2

I have a large CSV data file -- ~1,444,000 rows of data -- that I am reading in and converting to a numpy array. I read three of 22 columns. This is what I am currently doing:

import numpy as np
import csv

fid = open('data.csv', 'r')
csvfile = csv.reader(fid, dialect='excel', delimiter=',')
csvfile.next() # to skip header

t = []
u = []
w = []
for line in csvfile:
  t += [line[1]] # time
  u += [line[-4]] # velocity x
  w += [line[-2]] # velocity z
t = np.array(t, dtype='float')  
u = np.array(u, dtype='float')
w = np.array(w, dtype='float')

So my question is: Is this efficient? I was originally going to append the new data to an existing numpy array in the loop until I read that the whole array has to me moved each time in memory.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Usagi
  • 2,896
  • 2
  • 28
  • 38
  • 1
    Not to plug my own answer too much, but have a look at this question: http://stackoverflow.com/a/8964779/325565 for some memory and execution time profiling of different ways of reading a large text file into a numpy array. In short, if you're really worried about efficiency, `numpy.fromiter` is often very useful. – Joe Kington Feb 16 '12 at 22:25
  • 1
    Your original solution (appending to a numpy array) would have absolutely been slower. It would have been a quadratic-time operation across the whole list. Python arrays allow constant-time appends, which would result in a linear-time operation across the whole list. – HardlyKnowEm Feb 16 '12 at 22:27

2 Answers2

5

I would suggest numpy.loadtxt(). I haven't used it for csv, but you can set the delimiter to ',' and retrieve just the columns you need as a numpy ndarray.

I suspect the following would work:

# To load only columns 1 (time), 19 (velocity x), and 21 (velocity z).
numpy.loadtxt('data.csv', delimiter=',', usecols=(1,19,21))
GerritS
  • 423
  • 3
  • 8
  • This will be much faster, but it will use more memory. 1444000 rows * 22 columns * 8 bytes per double-precision-float ~= 240 megabytes. In the code given as part of the question, the memory usage is cut by about a factor of three. – HardlyKnowEm Feb 16 '12 at 22:28
  • 1
    @mlefavor Good point, I hadn't considered that. However if I'm reading this correctly, [the source](https://github.com/numpy/numpy/blob/master/numpy/lib/npyio.py#L825) shows each column is temporarily loaded but specifying `usecols` causes only the desire columns to be added to the return array. If that's the case then the additional memory is minimal (just 19 columns * 8 bytes) and recycled for each row. – GerritS Feb 17 '12 at 03:35
  • I rescind my objection, then. That's better than anything Python could offer--especially if these are floats, not doubles, since pure Python supports only doubles. – HardlyKnowEm Feb 17 '12 at 18:22
4

There's an easy way to find out which is more efficient--write both implementations (plain lists and numpy) and profile them: http://docs.python.org/library/profile.html.

If you're on a *nix OS, you can also do simpler measurement: run each version of the script as $ time python script.py.

As a side note, instead of this

t += [line[1]] # time

use this

t.append(line[1]) # time

larsbutler
  • 525
  • 3
  • 11
  • 1
    Just to explain why `.append` is better than `+=`: when you do `+=`, that's like saying `t = t + [line[1]]`, so it constructs the intermediate array `t + [line[1]]`, which involves copying all of `t`, and then throws away the original `t`. `.append`, by contrast, only needs to copy the whole list if there's not enough space right after the original `t` to add `line[1]`. – Danica Feb 16 '12 at 22:26
  • Thanks larsbutler for the tip and @Dougal for the explanation. – Usagi Feb 16 '12 at 23:25
  • Actually, that's not true. For lists, `t += [x]` just turns into `t.extend([x])` underneath. However, there's no reason to construct the list `[x]` instead of just using `t.append(x)`. – Robert Kern Feb 23 '12 at 12:24