language-agnostic - the - split string on commas but ignore commas within double-quotes




Split a string ignoring quoted sections (8)

Given a string like this:

a,"string, with",various,"values, and some",quoted

What is a good algorithm to split this based on commas while ignoring the commas inside the quoted sections?

The output should be an array:

[ "a", "string, with", "various", "values, and some", "quoted" ]


Python:

import csv
reader = csv.reader(open("some.csv"))
for row in reader:
    print row

A very complrehesive library can be found here: FileHelpers


I use itertools (especially cycle, repeat, chain) to make python behave more like R and in other functional / vector applications. Often this lets me avoid the overhead and complication of Numpy.

# in R, shorter iterables are automatically cycled
# and all functions "apply" in a "map"-like way over lists
> 0:10 + 0:2
 [1]  0  2  4  3  5  7  6  8 10  9 11

Python #Normal python In [1]: range(10) + range(3) Out[1]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2]

## this code is terrible, but it demos the idea.
from itertools import cycle
def addR(L1,L2):
    n = max( len(L1), len(L2))
    out = [None,]*n
    gen1,gen2 = cycle(L1), cycle(L2)
    ii = 0
    while ii < n:
        out[ii] = gen1.next() + gen2.next()
        ii += 1
    return out

In [21]: addR(range(10), range(3))
Out[21]: [0, 2, 4, 3, 5, 7, 6, 8, 10, 9]

If my language of choice didn't offer a way to do this without thinking then I would initially consider two options as the easy way out:

  1. Pre-parse and replace the commas within the string with another control character then split them, followed by a post-parse on the array to replace the control character used previously with the commas.

  2. Alternatively split them on the commas then post-parse the resulting array into another array checking for leading quotes on each array entry and concatenating the entries until I reached a terminating quote.

These are hacks however, and if this is a pure 'mental' exercise then I suspect they will prove unhelpful. If this is a real world problem then it would help to know the language so that we could offer some specific advice.


Microsoft's TextFieldParser is stable and follows RFC 4180 for CSV files. Don't be put off by the Microsoft.VisualBasic namespace; it's a standard component in the .NET Framework, just add a reference to the global Microsoft.VisualBasic assembly.

If you're compiling for Windows (as opposed to Mono) and don't anticipate having to parse "broken" (non-RFC-compliant) CSV files, then this would be the obvious choice, as it's free, unrestricted, stable, and actively supported, most of which cannot be said for FileHelpers.

See also: How to: Read From Comma-Delimited Text Files in Visual Basic for a VB code example.


Oft overlooked modules, uses and tricks:

collections.defaultdict(): for when you want missing keys in a dict to have a default value.

functools.wraps(): for writing decorators that play nicely with introspection.

posixpath: the os.path module for POSIX systems. You can use it for manipulating POSIX paths (including URI elements) even on Windows and other non-POSIX systems.

ntpath: the os.path module for Windows; usable for manipulation of Windows paths on non-Windows systems.

(also: macpath, for MacOS 9 and earlier, os2emxpath for OS/2 EMX, but I'm not sure if anyone still cares.)

pprint: more structured printing of the repr() of containers makes debugging much easier.

imp: all the tools you need to write your own plugin system or make Python import modules from arbitrary archives.

rlcompleter: getting tab-completion in the normal interactive interpreter. Just do "import readline, rlcompleter; readline.parse_and_bind('tab: complete')"

the PYTHONSTARTUP environment variable: can be set to the path to a file that will be executed (in the main namespace) when entering the interactive interpreter; useful for putting things in like the rlcompleter recipe above.



Delimited string parsing?

I use this to read from a file

string filename = @textBox1.Text;
string[] fields;
string[] delimiter = new string[] {"|"};
using (Microsoft.VisualBasic.FileIO.TextFieldParser parser =
       new Microsoft.VisualBasic.FileIO.TextFieldParser(filename)) {
    parser.Delimiters = delimiter;
    parser.HasFieldsEnclosedInQuotes = false;

    while (!parser.EndOfData) {
        fields = parser.ReadFields();
        //Do what you need
    }
}

I am sure someone here can transform this to parser a string that is in memory.





csv