Comma-separated values

Revision as of 14:28, 29 May 2025 by imported>Paul2520 (Rescuing 10 sources and tagging 0 as dead.) #IABot (v2.0.9.5)
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

Template:Short description Template:Infobox file format Comma-separated values (CSV) is a text file format that uses commas to separate values, and newlines to separate records. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file. If the field delimiter itself may appear within a field, fields can be surrounded with quotation marks.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

The CSV file format is one type of delimiter-separated file format.<ref>Template:Cite book</ref> Delimiters frequently used include the comma, tab, space, and semicolon. Delimiter-separated files are often given a ".csv" extension even when the field separator is not a comma. Many applications or libraries that consume or produce CSV files have options to specify an alternative delimiter.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

The lack of adherence to the CSV standard RFC 4180 necessitates the support for a variety of CSV formats in data input software. Despite this drawback, CSV remains widespread in data applications and is widely supported by a variety of software, including common spreadsheet applications such as Microsoft Excel.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> Benefits cited in favor of CSV include human readability and the simplicity of the format.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

ApplicationsEdit

CSV is a common data exchange format that is widely supported by consumer, business, and scientific applications. Among its most common uses is moving tabular data<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref><ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> between programs that natively operate on incompatible (often proprietary or undocumented) formats.<ref name="rfc4180">Template:Cite IETF</ref> For example, a user may need to transfer information from a database program that stores data in a proprietary format, to a spreadsheet that uses a completely different format. Most database programs can export data as CSV. Most spreadsheet programs can read CSV data, allowing CSV to be used as an intermediate format when transferring data from a database to a spreadsheet. Every major ecommerce platform provides support for exporting data as a CSV file.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

CSV is also used for storing data. Common data science tools such as Pandas include the option to export data to CSV for long-term storage.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> Benefits of CSV for data storage include the simplicity of CSV makes parsing and creating CSV files easy to implement and fast compared to other data formats, human readability making editing or fixing data simpler,<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> and high compressibility leading to smaller data files.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> Alternatively, CSV does not support more complex data relations and makes no distinction between null and empty values, and in applications where these features are needed other formats are preferred.

More than 200 local, regional, and national data portals, such as those of the UK government and the European Commission, use CSV files with standardized data catalogs.<ref>Template:Cite book</ref>

SpecificationEdit

Template:IETF RFC proposes a specification for the CSV format; however, actual practice often does not follow the RFC and the term "CSV" might refer to any file that:<ref name="rfc4180"/><ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

  1. is plain text using a character encoding such as ASCII, various Unicode character encodings (e.g. UTF-8), EBCDIC, or Shift JIS,
  2. consists of records (typically one record per line),
  3. with the records divided into fields separated by a comma,
  4. where every record has the same sequence of fields.

Within these general constraints, many variations are in use. Therefore, without additional information (such as whether RFC 4180 is honored), a file claimed simply to be in "CSV" format is not fully specified.

HistoryEdit

Comma-separated values is a data format that predates personal computers by more than a decade: the IBM Fortran (level H extended) compiler under OS/360 supported CSV in 1972.<ref>Template:Citation</ref> List-directed ("free form") input/output was defined in FORTRAN 77, approved in 1978. List-directed input used commas or spaces for delimiters, so unquoted character strings could not contain commas or spaces.<ref>Template:Citation</ref>

The term "comma-separated value" and the "CSV" abbreviation were in use by 1983.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> The manual for the Osborne Executive computer, which bundled the SuperCalc spreadsheet, documents the CSV quoting convention that allows strings to contain embedded commas, but the manual does not specify a convention for embedding quotation marks within quoted strings.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

Comma-separated value lists are easier to type (for example into punched cards) than fixed-column-aligned data, and they were less prone to producing incorrect results if a value was punched one column off from its intended location.

Comma separated files are used for the interchange of database information between machines of two different architectures. The plain-text character of CSV files largely avoids incompatibilities such as byte-order and word size. The files are largely human-readable, so it is easier to deal with them in the absence of perfect documentation or communication.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

The main standardization initiative—transforming "de facto fuzzy definition" into a more precise and de jure one—was in 2005, with Template:IETF RFC, defining CSV as a MIME Content Type.<ref>Template:Cite RFC</ref> Later, in 2013, some of RFC 4180's deficiencies were tackled by a W3C recommendation.<ref>See sparql11-results-csv-tsv, the first W3C recommendation scoped in CSV and filling some of RFC 4180's deficiencies.</ref>

In 2014 IETF published Template:IETF RFC describing the application of URI fragments to CSV documents. RFC 7111 specifies how row, column, and cell ranges can be selected from a CSV document using position indexes.<ref>Template:Cite RFC</ref>

In 2015 W3C, in an attempt to enhance CSV with formal semantics, publicized the first drafts of recommendations for CSV metadata standards, which began as recommendations in December of the same year.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

General functionalityEdit

CSV formats are best used to represent sets or sequences of records in which each record has an identical list of fields. This corresponds to a single relation in a relational database, or to data (though not calculations) in a typical spreadsheet.

The format dates back to the early days of business computing and is widely used to pass data between computers with different internal word sizes, data formatting needs, and so forth. For this reason, CSV files are common on all computer platforms.

CSV is a delimited text file that uses a comma to separate values (many implementations of CSV import/export tools allow other separators to be used; for example, the use of a "Sep=^" row as the first row in the *.csv file will cause Excel to open the file expecting caret "^" to be the separator instead of comma ","). Simple CSV implementations may prohibit field values that contain a comma or other special characters such as newlines. More sophisticated CSV implementations permit them, often by requiring " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or less commonly, newlines). Embedded double quote characters may then be represented by a pair of consecutive double quotes,<ref>*Template:Citation</ref> or by prefixing a double quote with an escape character such as a backslash (for example in Sybase Central).

CSV formats are not limited to a particular character set.<ref name="rfc4180"/> They work just as well with Unicode character sets (such as UTF-8 or UTF-16) as with ASCII (although particular programs that support CSV may have their own limitations). CSV files normally will even survive naïve translation from one character set to another (unlike nearly all proprietary data formats). CSV does not, however, provide any way to indicate what character set is in use, so that must be communicated separately, or determined at the receiving end (if possible).

Databases that include multiple relations cannot be exported as a single CSV fileTemplate:Citation needed. Similarly, CSV cannot naturally represent hierarchical or object-oriented data. This is because every CSV record is expected to have the same structure. CSV is therefore rarely appropriate for documents created with HTML, XML, or other markup or word-processing technologies.

Statistical databases in various fields often have a generally relation-like structure, but with some repeatable groups of fields. For example, health databases such as the Demographic and Health Survey typically repeat some questions for each child of a given parent (perhaps up to a fixed maximum number of children). Statistical analysis systems often include utilities that can "rotate" such data; for example, a "parent" record that includes information about five children can be split into five separate records, each containing (a) the information on one child, and (b) a copy of all the non-child-specific information. CSV can represent either the "vertical" or "horizontal" form of such data.

In a relational database, similar issues are readily handled by creating a separate relation for each such group, and connecting "child" records to the related "parent" records using a foreign key (such as an ID number or name for the parent). In markup languages such as XML, such groups are typically enclosed within a parent element and repeated as necessary (for example, multiple <child> nodes within a single <parent> node). With CSV there is no widely accepted single-file solution.

StandardizationEdit

The name "CSV" indicates the use of the comma to separate data fields. Nevertheless, the term "CSV" is widely used to refer to a large family of formats that differ in many ways. Some implementations allow or require single or double quotation marks around some or all fields; and some reserve the first record as a header containing a list of field names. The character set being used is undefined: some applications require a Unicode byte order mark (BOM) to enforce Unicode interpretation (sometimes even a UTF-8 BOM).<ref name="rfc4180"/> Files that use the tab character instead of comma can be more precisely referred to as "TSV" for tab-separated values.

Other implementation differences include the handling of more commonplace field separators (such as space or semicolon) and newline characters inside text fields. One more subtlety is the interpretation of a blank line: it can equally be the result of writing a record of zero fields, or a record of one field of zero length; thus decoding it is ambiguous.

RFC 4180 and MIME standardsEdit

The 2005 technical standard RFC 4180 formalizes the CSV file format and defines the MIME type "text/csv" for the handling of text-based fields. However, the interpretation of the text of each field is still application-specific. Files that follow the RFC 4180 standard can simplify CSV exchange and should be widely portable. Among its requirements:

  • MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
  • An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
  • Each record should contain the same number of comma-separated fields.
  • Any field may be quoted (with double quotes).
  • Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly.)
  • If double-quotes are used to enclose fields, then a double-quote in a field must be represented by two double-quote characters.

The format can be processed by most programs that claim to read CSV files. The exceptions are (a) programs may not support line-breaks within quoted fields, (b) programs may confuse the optional header with data or interpret the first data line as an optional header, and (c) double-quotes in a field may not be parsed correctly automatically.

OKF frictionless tabular data packageEdit

In 2011 Open Knowledge Foundation (OKF) and various partners created a data protocols working group, which later evolved into the Frictionless Data initiative. One of the main formats they released was the Tabular Data Package. Tabular Data package was heavily based on CSV, using it as the main data transport format and adding basic type and schema metadata (CSV lacks any type information to distinguish the string "1" from the number 1).<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

The Frictionless Data Initiative has also provided a standard CSV Dialect Description Format for describing different dialects of CSV, for example specifying the field separator or quoting rules.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

W3C tabular data standardEdit

In 2013 the W3C "CSV on the Web" working group began to specify technologies providing higher interoperability for web applications using CSV or similar formats.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> The working group completed its work in February 2016 and is officially closed in March 2016 with the release of a set of documents and W3C recommendations<ref>CSV on the Web Repository (on GitHub)</ref> for modeling "Tabular Data",<ref>Model for Tabular Data and Metadata on the Web Template:Webarchive (W3C Recommendation)</ref> and enhancing CSV with metadata and semantics.

While the well-formedness of CSV data can readily checked, testing validity and canonical form is less well developed, relative to more precise data formats, such as XML and SQL, which offer richer types and rules-based validation.<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

Basic rulesEdit

Many informal documents exist that describe "CSV" formats. IETF RFC 4180 (summarized above) defines the format for the "text/csv" MIME type registered with the IANA.

Rules typical of these and other "CSV" specifications and implementations are as follows: Template:Unordered list

ExampleEdit

Year Make Model Description Price
1997 Ford E350 ac, abs, moon 3000.00
1999 Chevy Venture "Extended Edition" 4900.00
1999 Chevy Venture "Extended Edition, Very Large" 5000.00
1996 Jeep Grand Cherokee MUST SELL!
air, moon roof, loaded
4799.00

The above table of data may be represented in CSV format as follows:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

Example of a USA/UK CSV file (where the decimal separator is a period/full stop and the value separator is a comma):

Year,Make,Model,Length
1997,Ford,E350,2.35
2000,Mercury,Cougar,2.38

Example of an analogous European CSV/DSV file (where the decimal separator is a comma and the value separator is a semicolon):

Year;Make;Model;Length
1997;Ford;E350;2,35
2000;Mercury;Cougar;2,38

The latter format is not RFC 4180 compliant.<ref>Template:Harvtxt states, "Within the header and each record, there may be one or more fields, separated by commas."</ref> Compliance could be achieved by the use of a comma instead of a semicolon as a separator and by quoting all numbers that have a decimal mark.

Application supportEdit

Some applications use CSV as a data interchange format to enhance its interoperability, exporting and importing CSV. Others use CSV as an internal format.

As a data interchange format: the CSV file format is supported by almost all spreadsheets and database management systems,

  • Spreadsheets including Apple Numbers, LibreOffice Calc, and Apache OpenOffice Calc. Microsoft Excel also supports a dialect of CSV with restrictions in comparison to other spreadsheet software (e.g., Template:As of Excel still cannot export CSV files in the commonly used UTF-8 character encoding, and separator is not enforced to be the comma). LibreOffice Calc CSV importer is actually a more generic delimited text importer, supporting multiple separators at the same time as well as field trimming.
  • Various Relational databases support saving query results to a CSV file. PostgreSQL provides the COPY command, which allows for both saving and loading data to and from a file. <syntaxhighlight lang="postgres" class="" style="" inline="1">COPY (SELECT * FROM articles) TO '/home/wikipedia/file.csv' (FORMAT csv)</syntaxhighlight> saves the content of a table articles to a file called /home/wikipedia/file.csv.<ref>{{#invoke:citation/CS1|citation

|CitationClass=web }}</ref>

  • Many utility programs on Unix-style systems (such as cut, paste, join, sort, uniq, awk) can split files on a comma delimiter, and can therefore process simple CSV files. However, this method does not correctly handle commas or new lines within quoted strings, hence it is better to use tools like csvkit or Miller.

As (main or optional) internal representation. Can be native or foreign, but differ from interchange format ("export/import only") because it is not necessary to create a copy in another format:

  • Some Spreadsheets including LibreOffice Calc offers this option, without enforcing user to adopt another format.
  • Some relational databases, when using standard SQL, offer foreign-data wrapper (FDW). For example, PostgreSQL offers the <syntaxhighlight lang="postgres" class="" style="" inline="1">CREATE FOREIGN TABLE</syntaxhighlight><ref>{{#invoke:citation/CS1|citation

|CitationClass=web }}</ref> and <syntaxhighlight lang="postgres" class="" style="" inline="1">CREATE EXTENSION file_fdw</syntaxhighlight><ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref> commands to configure any variant of CSV.

  • Databases like Apache Hive offer the option to express CSV or .csv.gz as an internal table format.
  • The emacs editor can operate on CSV files using csv-nav mode.<ref>{{#invoke:citation/CS1|citation

|CitationClass=web }}</ref>

CSV format is supported by libraries available for many programming languages. Most provide some way to specify the field delimiter, decimal separator, character encoding, quoting conventions, date format, etc.

Software and row limitsEdit

Programs that work with CSV may have limits on the maximum number of rows CSV files can have. Below is a list of common software and its limitations:<ref>{{#invoke:citation/CS1|citation |CitationClass=web }}</ref>

  • Microsoft Excel: 1,048,576 row limit;
  • Microsoft PowerShell, no row or cell limit. (Memory Limited)
  • Apple Numbers: 1,000,000 row limit;
  • Google Sheets: 10,000,000 cell limit (the product of columns and rows);
  • OpenOffice and LibreOffice: 1,048,576 row limit;
  • Sourcetable:<ref>large data spreadsheet Sourcetable Inc., 2024. Retrieved 2024-11-14.</ref> no row limit. (Spreadsheet-database hybrid);
  • Text Editors (such as WordPad, TextEdit, Vim, etc.): no row or cell limit;
  • Databases (COPY command and FDW): no row or cell limit.

See alsoEdit

ReferencesEdit

Template:Reflist

Further readingEdit

  • {{#invoke:citation/CS1|citation

|CitationClass=web }} (Has file descriptions of delimited ASCII (.DEL) (including comma- and semicolon-separated) and non-delimited ASCII (.ASC) files for data transfer.)

Template:Data Exchange