Coder Perfect

In Python, how can I retrieve the line count of a huge file quickly?


In Python, I need to count the lines in a huge file (hundreds of thousands of lines). What is the most efficient method in terms of memory and time?

Right now, I’m doing:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
    return i + 1

Is there any way to make it better?

Asked by SilentGhost

Solution #1

One line, probably quickly:

num_lines = sum(1 for line in open('myfile.txt'))

Answered by Kyle

Solution #2

It doesn’t get much better than this.

After all, any solution must read the full file, calculate the number of n, and return that number.

Is there a better method to do this without having to read the entire file? I am unsure… The ideal solution will always be I/O-bound; the most you can do is make sure you’re not wasting memory, which it appears you’ve already done.

Answered by Yuval Adam

Solution #3

A memory mapped file, I believe, will be the quickest option. I tested four functions: the OP’s function (opcount); a basic iteration through the file’s lines (simplecount); readline using a memory-mapped file (mmap) (mapcount); and Mykola Kharechko’s buffer read solution (buffercount) (bufcount).

I calculated the average run-time for a 1.2 million-line text file by running each function five times.

2 GHz AMD processor, Windows XP, Python 2.5, 2GB RAM

Here are my findings:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Numbers for Python 2.6 have been updated:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

For Windows/Python 2.6, the buffer read strategy appears to be the fastest.

The code is as follows:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
    return i + 1

counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

Answered by Ryan Ginstrom

Solution #4

I had to submit this on a similar issue until my reputation score improved (thanks to the person who bumped me!).

All of these solutions overlook one method for significantly speeding up the process: using the unbuffered (raw) interface, bytearrays, and doing your own buffering. (This is only true for Python 3.) The raw interface may or may not be utilized by default in Python 2, but Unicode will be the default in Python 3.)

I feel the following code is faster (and marginally more pythonic) than any of the solutions presented, using a modified version of the timing tool:

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f =

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

This runs slightly faster thanks to the use of a separate generator function:

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(
    return sum( buf.count(b'\n') for buf in f_gen )

This can be done entirely in-line with itertools and generator expressions, but it looks a little strange:

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

My schedule is as follows:

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

Answered by Michael Bacon

Solution #5

You could run wc -l filename from a subprocess.

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

Answered by Ólafur Waage

Post is based on