Problem
In Python, I need to count the lines in a huge file (hundreds of thousands of lines). What is the most efficient method in terms of memory and time?
Right now, I’m doing:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
Is there any way to make it better?
Asked by SilentGhost
Solution #1
One line, probably quickly:
num_lines = sum(1 for line in open('myfile.txt'))
Answered by Kyle
Solution #2
It doesn’t get much better than this.
After all, any solution must read the full file, calculate the number of n, and return that number.
Is there a better method to do this without having to read the entire file? I am unsure… The ideal solution will always be I/O-bound; the most you can do is make sure you’re not wasting memory, which it appears you’ve already done.
Answered by Yuval Adam
Solution #3
A memory mapped file, I believe, will be the quickest option. I tested four functions: the OP’s function (opcount); a basic iteration through the file’s lines (simplecount); readline using a memory-mapped file (mmap) (mapcount); and Mykola Kharechko’s buffer read solution (buffercount) (bufcount).
I calculated the average run-time for a 1.2 million-line text file by running each function five times.
2 GHz AMD processor, Windows XP, Python 2.5, 2GB RAM
Here are my findings:
mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714
Numbers for Python 2.6 have been updated:
mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297
For Windows/Python 2.6, the buffer read strategy appears to be the fastest.
The code is as follows:
from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict
def mapcount(filename):
f = open(filename, "r+")
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
def simplecount(filename):
lines = 0
for line in open(filename):
lines += 1
return lines
def bufcount(filename):
f = open(filename)
lines = 0
buf_size = 1024 * 1024
read_f = f.read # loop optimization
buf = read_f(buf_size)
while buf:
lines += buf.count('\n')
buf = read_f(buf_size)
return lines
def opcount(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
counts = defaultdict(list)
for i in range(5):
for func in [mapcount, simplecount, bufcount, opcount]:
start_time = time.time()
assert func("big_file.txt") == 1209138
counts[func].append(time.time() - start_time)
for key, vals in counts.items():
print key.__name__, ":", sum(vals) / float(len(vals))
Answered by Ryan Ginstrom
Solution #4
I had to submit this on a similar issue until my reputation score improved (thanks to the person who bumped me!).
All of these solutions overlook one method for significantly speeding up the process: using the unbuffered (raw) interface, bytearrays, and doing your own buffering. (This is only true for Python 3.) The raw interface may or may not be utilized by default in Python 2, but Unicode will be the default in Python 3.)
I feel the following code is faster (and marginally more pythonic) than any of the solutions presented, using a modified version of the timing tool:
def rawcount(filename):
f = open(filename, 'rb')
lines = 0
buf_size = 1024 * 1024
read_f = f.raw.read
buf = read_f(buf_size)
while buf:
lines += buf.count(b'\n')
buf = read_f(buf_size)
return lines
This runs slightly faster thanks to the use of a separate generator function:
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024*1024)
def rawgencount(filename):
f = open(filename, 'rb')
f_gen = _make_gen(f.raw.read)
return sum( buf.count(b'\n') for buf in f_gen )
This can be done entirely in-line with itertools and generator expressions, but it looks a little strange:
from itertools import (takewhile,repeat)
def rawincount(filename):
f = open(filename, 'rb')
bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
return sum( buf.count(b'\n') for buf in bufgen )
My schedule is as follows:
function average, s min, s ratio
rawincount 0.0043 0.0041 1.00
rawgencount 0.0044 0.0042 1.01
rawcount 0.0048 0.0045 1.09
bufcount 0.008 0.0068 1.64
wccount 0.01 0.0097 2.35
itercount 0.014 0.014 3.41
opcount 0.02 0.02 4.83
kylecount 0.021 0.021 5.05
simplecount 0.022 0.022 5.25
mapcount 0.037 0.031 7.46
Answered by Michael Bacon
Solution #5
You could run wc -l filename from a subprocess.
import subprocess
def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])
Answered by Ólafur Waage
Post is based on https://stackoverflow.com/questions/845058/how-to-get-line-count-of-a-large-file-cheaply-in-python