loop - r list()

How to Correctly Use Lists in R? (8)

Brief background: Many (most?) contemporary programming languages in widespread use have at least a handful of ADTs [abstract data types] in common, in particular,

  • string (a sequence comprised of characters)

  • list (an ordered collection of values), and

  • map-based type (an unordered array that maps keys to values)

In the R programming language, the first two are implemented as character and vector, respectively.

When I began learning R, two things were obvious almost from the start: list is the most important data type in R (because it is the parent class for the R data.frame), and second, I just couldn't understand how they worked, at least not well enough to use them correctly in my code.

For one thing, it seemed to me that R's list data type was a straightforward implementation of the map ADT (dictionary in Python, NSMutableDictionary in Objective C, hash in Perl and Ruby, object literal in Javascript, and so forth).

For instance, you create them just like you would a Python dictionary, by passing key-value pairs to a constructor (which in Python is dict not list):

x = list("ev1"=10, "ev2"=15, "rv"="Group 1")

And you access the items of an R List just like you would those of a Python dictionary, e.g., x['ev1']. Likewise, you can retrieve just the 'keys' or just the 'values' by:

names(x)    # fetch just the 'keys' of an R list
# [1] "ev1" "ev2" "rv"

unlist(x)   # fetch just the 'values' of an R list
#   ev1       ev2        rv 
#  "10"      "15" "Group 1" 

x = list("a"=6, "b"=9, "c"=3)  

# [1] 18

but R lists are also unlike other map-type ADTs (from among the languages I've learned anyway). My guess is that this is a consequence of the initial spec for S, i.e., an intention to design a data/statistics DSL [domain-specific language] from the ground-up.

three significant differences between R lists and mapping types in other languages in widespread use (e.g,. Python, Perl, JavaScript):

first, lists in R are an ordered collection, just like vectors, even though the values are keyed (ie, the keys can be any hashable value not just sequential integers). Nearly always, the mapping data type in other languages is unordered.

second, lists can be returned from functions even though you never passed in a list when you called the function, and even though the function that returned the list doesn't contain an (explicit) list constructor (Of course, you can deal with this in practice by wrapping the returned result in a call to unlist):

x = strsplit(LETTERS[1:10], "")     # passing in an object of type 'character'

class(x)                            # returns 'list', not a vector of length 2
# [1] list

A third peculiar feature of R's lists: it doesn't seem that they can be members of another ADT, and if you try to do that then the primary container is coerced to a list. E.g.,

x = c(0.5, 0.8, 0.23, list(0.5, 0.2, 0.9), recursive=TRUE)

# [1] list

my intention here is not to criticize the language or how it is documented; likewise, I'm not suggesting there is anything wrong with the list data structure or how it behaves. All I'm after is to correct is my understanding of how they work so I can correctly use them in my code.

Here are the sorts of things I'd like to better understand:

  • What are the rules which determine when a function call will return a list (e.g., strsplit expression recited above)?

  • If I don't explicitly assign names to a list (e.g., list(10,20,30,40)) are the default names just sequential integers beginning with 1? (I assume, but I am far from certain that the answer is yes, otherwise we wouldn't be able to coerce this type of list to a vector w/ a call to unlist.)

  • Why do these two different operators, [], and [[]], return the same result?

    x = list(1, 2, 3, 4)

    both expressions return "1":



  • why do these two expressions not return the same result?

    x = list(1, 2, 3, 4)

    x2 = list(1:4)

Please don't point me to the R Documentation (?list, R-intro)--I have read it carefully and it does not help me answer the type of questions I recited just above.

(lastly, I recently learned of and began using an R Package (available on CRAN) called hash which implements conventional map-type behavior via an S4 class; I can certainly recommend this Package.)

Alhough this is pretty old question I must say it is touching exactly the knowledge I was missing during my first steps in R - i.e. how to express data in my hand as object in R or how to select from existing objects. It is not easy for a R novice to think "in a R box" from the very beginning.

So I myself started to use crutches below which helped me a lot to find out what object to use for what data, and basically to imagine real world usage.

Though I not giving exact answers to the question the short text below might help reader who just started with R and is asking simmilar questions.

  • Atomic vector ... I called that "sequence" for myself, no direction, just sequence of same types. [ subsets.
  • Vector ... sequence with one direction from 2D, [ subsets.
  • Matrix ... bunch of vectors with same length forming rows or columns, [ subsets by rows and columns, or by sequence.
  • Arrays ... layered matrices forming 3D
  • Dataframe ... a 2D table like in excel, where I can sort, add or remove rows or columns or make arit. operations with them, only after some time I truly recognized that dataframe is an clever implementation of list where I can subset using [ by rows and columns, but even using [[.
  • List ... to help myself I thought about list as of tree structure where [i] selects and returns whole branches and [[i]] returns item from the branch. And because it it is tree like structure, you can even use an index sequence to address every single leaf on a very complex list using its [[index_vector]]. Lists can be simple or very complex and can mix together various types of objects into one.

So for lists you can end up with more ways how to select a leaf depending on situation like in following example.

l <- list("aaa",5,list(1:3),LETTERS[1:4],matrix(1:9,3,3))
l[[c(5,4)]] # selects 4 from matrix using [[index_vector]] in list
l[[5]][4] # selects 4 from matrix using sequential index in matrix
l[[5]][1,2] # selects 4 from matrix using row and column in matrix

This way of thinking helped me a lot.

If it helps, I tend to conceive "lists" in R as "records" in other pre-OO languages:

  • they do not make any assumptions about an overarching type (or rather the type of all possible records of any arity and field names is available).
  • their fields can be anonymous (then you access them by strict definition order).

The name "record" would clash with the standard meaning of "records" (aka rows) in database parlance, and may be this is why their name suggested itself: as lists (of fields).

Just to address the last part of your question, since that really points out the difference between a list and vector in R:

Why do these two expressions not return the same result?

x = list(1, 2, 3, 4); x2 = list(1:4)

A list can contain any other class as each element. So you can have a list where the first element is a character vector, the second is a data frame, etc. In this case, you have created two different lists. x has four vectors, each of length 1. x2 has 1 vector of length 4:

> length(x[[1]])
[1] 1
> length(x2[[1]])
[1] 4

So these are completely different lists.

R lists are very much like a hash map data structure in that each index value can be associated with any object. Here's a simple example of a list that contains 3 different classes (including a function):

> complicated.list <- list("a"=1:4, "b"=1:3, "c"=matrix(1:4, nrow=2), "d"=search)
> lapply(complicated.list, class)
[1] "integer"
[1] "integer"
[1] "matrix"
[1] "function"

Given that the last element is the search function, I can call it like so:

> complicated.list[["d"]]()
[1] ".GlobalEnv" ...

As a final comment on this: it should be noted that a data.frame is really a list (from the data.frame documentation):

A data frame is a list of variables of the same number of rows with unique row names, given class ‘"data.frame"’

That's why columns in a data.frame can have different data types, while columns in a matrix cannot. As an example, here I try to create a matrix with numbers and characters:

> a <- 1:4
> class(a)
[1] "integer"
> b <- c("a","b","c","d")
> d <- cbind(a, b)
> d
 a   b  
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
[4,] "4" "d"
> class(d[,1])
[1] "character"

Note how I cannot change the data type in the first column to numeric because the second column has characters:

> d[,1] <- as.numeric(d[,1])
> class(d[,1])
[1] "character"

Just to take a subset of your questions:

This article on indexing addresses the question of the difference between [] and [[]].

In short [[]] selects a single item from a list and [] returns a list of the selected items. In your example, x = list(1, 2, 3, 4)' item 1 is a single integer but x[[1]] returns a single 1 and x[1] returns a list with only one value.

> x = list(1, 2, 3, 4)
> x[1]
[1] 1

> x[[1]]
[1] 1

Regarding vectors and the hash/array concept from other languages:

  1. Vectors are the atoms of R. Eg, rpois(1e4,5) (5 random numbers), numeric(55) (length-55 zero vector over doubles), and character(12) (12 empty strings), are all "basic".

  2. Either lists or vectors can have names.

    > n = numeric(10)
    > n
     [1] 0 0 0 0 0 0 0 0 0 0
    > names(n)
    > names(n) = LETTERS[1:10]
    > n
    A B C D E F G H I J 
    0 0 0 0 0 0 0 0 0 0
  3. Vectors require everything to be the same data type. Watch this:

    > i = integer(5)
    > v = c(n,i)
    > v
    A B C D E F G H I J           
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
    > class(v)
    [1] "numeric"
    > i = complex(5)
    > v = c(n,i)
    > class(v)
    [1] "complex"
    > v
       A    B    C    D    E    F    G    H    I    J                          
    0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i 0+0i
  4. Lists can contain varying data types, as seen in other answers and the OP's question itself.

I've seen languages (ruby, javascript) in which "arrays" may contain variable datatypes, but for example in C++ "arrays" must be all the same datatype. I believe this is a speed/efficiency thing: if you have a numeric(1e6) you know its size and the location of every element a priori; if the thing might contain "Flying Purple People Eaters" in some unknown slice, then you have to actually parse stuff to know basic facts about it.

Certain standard R operations also make more sense when the type is guaranteed. For example cumsum(1:9) makes sense whereas cumsum(list(1,2,3,4,5,'a',6,7,8,9)) does not, without the type being guaranteed to be double.

As to your second question:

Lists can be returned from functions even though you never passed in a List when you called the function

Functions return different data types than they're input all the time. plot returns a plot even though it doesn't take a plot as an input. Arg returns a numeric even though it accepted a complex. Etc.

(And as for strsplit: the source code is here.)

Regarding your questions, let me address them in order and give some examples:

1) A list is returned if and when the return statement adds one. Consider

 R> retList <- function() return(list(1,2,3,4)); class(retList())
 [1] "list"
 R> notList <- function() return(c(1,2,3,4)); class(notList())
 [1] "numeric"

2) Names are simply not set:

R> retList <- function() return(list(1,2,3,4)); names(retList())

3) They do not return the same thing. Your example gives

R> x <- list(1,2,3,4)
R> x[1]
[1] 1
R> x[[1]]
[1] 1

where x[1] returns the first element of x -- which is the same as x. Every scalar is a vector of length one. On the other hand x[[1]] returns the first element of the list.

4) Lastly, the two are different between they create, respectively, a list containing four scalars and a list with a single element (that happens to be a vector of four elements).

why do these two different operators, [ ], and [[ ]], return the same result?

x = list(1, 2, 3, 4)
  1. [ ] provides sub setting operation. In general sub set of any object will have the same type as the original object. Therefore, x[1] provides a list. Similarly x[1:2] is a subset of original list, therefore it is a list. Ex.

    [[1]] [1] 1
    [[2]] [1] 2
  2. [[ ]] is for extracting an element from the list. x[[1]] is valid and extract the first element from the list. x[[1:2]] is not valid as [[ ]] does not provide sub setting like [ ].

     x[[2]] [1] 2 
    > x[[2:3]] Error in x[[2:3]] : subscript out of bounds

x = list(1, 2, 3, 4)
x2 = list(1:4)

is not the same because 1:4 is the same as c(1,2,3,4). If you want them to be the same then:

x = list(c(1,2,3,4))
x2 = list(1:4)