table of contents
other sections
CSV(3) | Library Functions Manual | CSV(3) |
NAME¶
csv - CSV parser and writer librarySYNOPSIS¶
#include <libcsv/csv.h>
int csv_init(struct csv_parser * p, unsigned char options);
size_t csv_parse(struct csv_parser *p, const void * s , size_t len, void (* cb1)(void *, size_t, void *), void (* cb2)(int, void *), void * data );
int csv_fini(struct csv_parser *p, void (* cb1)(void *, size_t, void *), void (* cb2)(int, void *), void * data);
void csv_free(struct csv_parser *p); unsigned char csv_get_delim(struct csv_parser *p); unsigned char csv_get_quote(struct csv_parser *p); void csv_set_space_func(struct csv_parser *p, int (*f)(unsigned char)); void csv_set_term_func(struct csv_parser *p, int (*f)(unsigned char)); int csv_get_opts(struct csv_parser *p); int csv_set_opts(struct csv_parser *p, unsigned char options); int csv_error(struct csv_parser *p); char * csv_strerror(int error); size_t csv_write(void *dest, size_t dest_size, const void *src, size_t src_size); int csv_fwrite(FILE *fp, const void *src, size_t src_size); size_t csv_write2(void *dest, size_t dest_size, const void *src, size_t src_size, unsigned char quote); int csv_fwrite2(FILE *fp, const void *src, size_t src_size, unsigned char quote); void csv_set_realloc_func(struct csv_parser *p, void *(*func)(void *, size_t)); void csv_set_free_func(struct csv_parser *p, void (*func)(void *)); void csv_set_blk_size(struct csv_parser *p, size_t size); size_t csv_get_blk_size(struct csv_parser *p); size_t csv_get_buffer_size(struct csv_parser *p);
DESCRIPTION¶
The CSV library provides a flexible, intuitive interface for parsing and writing csv data.
OVERVIEW¶
The idea behind parsing with libcsv is straight-forward: you initialize a parser object with csv_init() and feed data to the parser over one or more calls to csv_parse() providing callback functions that handle end-of-field and end-of-row events. csv_parse() parses the data provided calling the user-defined callback functions as it reads fields and rows. When complete, csv_fini() is called to finish processing the current field and make a final call to the callback functions if necessary. csv_free() is then called to free the parser object. csv_error() and csv_strerror() provide information about errors encountered by the functions. csv_write() and csv_fwrite() provide a simple interface for converting raw data into CSV data and storing the result into a buffer or file respectively. CSV is a binary format allowing the storage of arbitrary binary data, files opened for reading or writing CSV data should be opened in binary mode. libcsv provides a default mode in which the parser will happily process any data as CSV without complaint, this is useful for parsing files which don't adhere to all the traditional rules. A strict mode is also supported which will cause any violation of the imposed rules to cause a parsing failure.ROUTINES¶
PARSING DATA- CSV_STRICT
- Enables strict mode.
- CSV_REPALL_NL
- Causes each instance of a carriage return or linefeed outside of a record to be reported.
- CSV_STRICT_FINI
- Causes unterminated quoted fields encountered in csv_fini() to cause a parsing error (see below).
- CSV_APPEND_NULL
- Will cause all fields to be nul-terminated when provided to cb1, introduced in 3.0.0.
- CSV_EMPTY_IS_NULL
- Will cause NULL to be passed as the first argument to cb1 for empty, unquoted, fields. Empty means consisting only of either spaces and tabs or the values defined by the a custom function registered via csv_set_space_func(). Added in 3.0.3.
- p is a pointer to an initialized struct csv_parser.
- s is a pointer to the data to read in, such as a dynamically allocated region of memory containing data read in from a call to fread().
- len is the number of bytes of data to process.
- cb1 is a pointer to the callback function that will be called from csv_parse() after an entire field has been read. cb1 will be called with a pointer to the parsed data (which is NOT nul-terminated unless the CSV_APPEND_NULL option is set), the number of bytes in the data, and the pointer that was passed to csv_parse().
- cb2 is a pointer to the callback function that will be called when the end of a record is encountered, it will be called with the character that caused the record to end, cast to an unsigned char, or -1 if called from csv_fini, and the pointer that was passed to csv_init().
- data is a pointer to user-defined data that will be passed to the callback functions when invoked.
- cb1 and/or cb2 may be NULL in which case no function will be called for the associated actions. data may also be NULL but the callback functions must be prepared to handle receiving a null pointer.
- CSV_EPARSE A parse error has occurred while in strict mode
- CSV_ENOMEM There was not enough memory while attempting to increase the entry buffer for the current field
- CSV_ETOOBIG Continuing to process the current field would require a buffer of more than SIZE_MAX bytes
THE CSV FORMAT¶
Although quite prevelant there is no standard for the CSV format. There are however, a set of traditional conventions used by many applications. libcsv follows the conventions described at http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm which seem to reflect the most common usage of the format, namely:- Fields are separated with commas.
- Rows are delimited by newline sequences (see below).
- Fields may be surrounded with quotes.
- Fields that contain comma, quote, or newline characters MUST be quoted.
- Each instance of a quote character must be escaped with an immediately preceding quote character.
- Leading and trailing spaces and tabs are removed from non-quoted fields.
- The final line need not contain a newline sequence.
- "Each line should contain the same number of fields throughout the file"
- libcsv doesn't care if every record contains a different number of fields, such a restriction could easily be enforced by the application itself if desired.
- "Spaces are considered part of a field and should not be ignored"
- Leading and trailing spaces that are part of non-quoted fields are ignored as this is by far the most common behavior and expected by many applications. abc , def is considered equivalent to: "abc", "def"
- "The last field in the record must not be followed by a comma"
- The meaning of this statement is not clear but if the last character of a record is a comma, libcsv will interpret that as a final empty field, i.e.: "abc", "def", will be interpreted as 3 fields, equivalent to: "abc", "def", ""
PARSING MALFORMED DATA¶
libcsv should correctly parse any CSV data that conforms to the rules discussed above. By default, however, libcsv will also attempt to parse malformed CSV data such as data containing unescaped quotes or quotes within non-quoted fields. For example:a"c, "d"f" would be parsed equivalently to the correct form: "a""c", "d""f"This is often desirable as there are some applications that do not adhere to the specifications previously discussed. However, there are instances where malformed CSV data is ambiguous, namely when a comma or newline is the next non-space character following a quote such as:
"Sally said "Hello", Wally said "Goodbye"" This could either be parsed as a single field containing the data: Sally said "Hello", Wally said "Goodbye" or as 2 separate fields:Sally said "Hello and Wally said "Goodbye"" Since the data is malformed, there is no way to know if the quote before the comma is meant to be a literal quote or if it signifies the end of the field. This is of course not an issue for properly formed data as all quotes must be escaped. libcsv will parse this example as 2 separate fields. libcsv provides a strict mode that will return with a parse error if a quote is seen inside a non-quoted field or if a non-escaped quote is seen whose next non-space character isn't a comma or newline sequence.
PARSER DETAILS¶
A field is considered quoted if the first non-space character for a new field is a quote. If a quote is encountered in a quoted field and the next non-space character is a comma, the field ends at the closed quote and the field data is submitted when the comma is encountered. If the next non-space character after a quote is a newline character, the row has ended and the field data is submitted and the end of row is signalled (via the appropriate callback function). If two quotes are immediately adjacent, the first one is interpreted as escaping the second one and one quote is written to the field buffer. If the next non-space character following a quote is anything else, the quote is interpreted as a non-escaped literal quote and it and what follows are written to the field buffer, this would cause a parse error in strict mode.Example 1 "abc""" Parses as: abc"The first quote marks the field as quoted, the second quote escapes the following quote and the last quote ends the field. This is valid in both strict and non-strict modes.
Example 2 "ab"c Parses as: ab"cThe first qute marks the field as quoted, the second quote is taken as a literal quote since the next non-space character is not a comma, or newline and the quote is not escaped. The last quote ends the field (assuming there is a newline character following). A parse error would result upon seeing the character c in strict mode.
Example 3 "abc" " Parses as: abc"In this case, since the next non-space character following the second quote is not a comma or newline character, a literal quote is written, the space character after is part of the field, and the last quote terminated the field. This demonstrates the fact that a quote must immediately precede another quote to escape it. This would be a strict-mode violation as all quotes are required to be escaped. If the field is not quoted, any quote character is taken as part of the field data, any comma terminated the field, and any newline character terminated the field and the record.
Example 4 ab""c Parses as: ab""cQuotes are not considered special in non-quoted fields. This would be a strict mode violation since quotes may not exist in non-quoted fields in strict mode.
EXAMPLES¶
The following example prints the number of fields and rows in a file. This is a simplified version of the csvinfo program provided in the examples directory. Error checking not related to libcsv has been removed for clarity, the csvinfo program also provides an option for enabling strict mode and handles multiple files.#include <stdio.h> #include <string.h> #include <errno.h> #include <stdlib.h> #include "libcsv/csv.h" struct counts { long unsigned fields; long unsigned rows; }; void cb1 (void *s, size_t len, void *data) { ((struct counts *)data)->fields++; } void cb2 (int c, void *data) { ((struct counts *)data)->rows++; } int main (int argc, char *argv[]) { FILE *fp; struct csv_parser p; char buf[1024]; size_t bytes_read; struct counts c = {0, 0}; if (csv_init(&p, 0) != 0) exit(EXIT_FAILURE); fp = fopen(argv[1], "rb"); if (!fp) exit(EXIT_FAILURE); while ((bytes_read=fread(buf, 1, 1024, fp)) > 0) if (csv_parse(&p, buf, bytes_read, cb1, cb2, &c) != bytes_read) { fprintf(stderr, "Error while parsing file: %s\n", csv_strerror(csv_error(&p)) ); exit(EXIT_FAILURE); } csv_fini(&p, cb1, cb2, &c); fclose(fp); printf("%lu fields, %lu rows\n", c.fields, c.rows); csv_free(&p); exit(EXIT_SUCCESS); }
See the examples directory for several complete example programs.
AUTHOR¶
Written by Robert Gamble.BUGS¶
Please send questions, comments, bugs, etc. to: rgamble@users.sourceforge.net9 January 2013 |