Split text file into lines with fixed number of words

Question

Related, but no satisfactory answers: How can I split a large text file into chunks of 500 words or so?

I'm trying to take a text file (http://mattmahoney.net/dc/text8.zip) with > 10^7 words all in one line, and split it into lines with N words each. My current approach works, but is fairly slow and ugly (using shell script):

i=0
for word in $(sed -e 's/\s\+/\n/g' input.txt)
do
    echo -n "${word} " > output.txt
    let "i=i+1"

    if [ "$i" -eq "1000" ]
    then
        echo > output.txt
        let "i=0"
    fi
done

Any tips on how I can make this faster or more compact?

if you want it faster, you need to use something else then bash script. I would recommend some C. It can fit to few lines. — Jakuje, 9 hours ago

1_CR · Accepted Answer · 2015-09-04 19:46:02Z

up vote 3 down vote accepted

Assuming your definition of word is a sequence of non-blank characters separated by blanks, here's an awk solution for your single-line file

awk '{for (i=1; i<=NF; ++i)printf "%s%s", $i, i % 500? " ": "\n"}i % 500{print ""}' file

answered 9 hours ago

1_CR
7,08531543

This is exactly what I needed, thanks! – Cory Schillaci 9 hours ago

add a comment |

chaos · Answer 2 · 2015-09-04 19:57:32Z

up vote 7 down vote

Use xargs (17 seconds):

xargs -n1000 <file >output

It uses the -n flag of xargs which defines the max number of arguments. Just change 1000 to 500 or whatever limit you want.

I made a test file with 10^7 words:

$ wc -w file
10000000 file

Here are the time stats:

$ time xargs -n1000 <file >output
real    0m16.677s
user    0m1.084s
sys     0m0.744s

edited 9 hours ago

answered 9 hours ago

chaos
16.4k22966

This is slightly slower than the answer I accepted (21s vs 12s on my file) – Cory Schillaci 9 hours ago

1

Excellent idea +1, however beware xargs's quote-stripping behaviour – 1_CR 9 hours ago

The lower the n the slower this will get, just so you know. With -n10 I cancelled it after about 8 minutes of waiting... – don_crissti 8 hours ago

add a comment |

glenn jackman · Answer 3 · 2015-09-04 20:44:23Z

Perl seems quite astonishingly good at this:

Create a file with 10,000,000 space separated words

for ((i=1; i<=10000000; i++)); do printf "%s " $RANDOM ; done > one.line

Now, perl to add a newline after each 1,000 words

time perl -pe '
    s{ 
        (?:\S+\s+){999} \S+   # 1000 words
        \K                    # then reset start of match
        \s+                   # and the next bit of whitespace
    }
    {\n}gx                    # replace whitespace with newline
' one.line > many.line

Timing

real    0m1.074s
user    0m0.996s
sys     0m0.076s

verify results

$ wc one.line many.line
        0  10000000  56608931 one.line
    10000  10000000  56608931 many.line
    10000  20000000 113217862 total

The accepted awk solution took just over 5 sec on my input file.

thrig · Answer 4 · 2015-09-04 21:09:41Z

The venerable fmt(1) command, while not strictly operating on "a particular number of words" can fairly quickly wrap long lines to a particular goal (or maximum) width:

perl -e 'for (1..100) { print "a"x int 3+rand(7), " " }' | fmt

Or with modern perl, for a specific number of words, say, 10, and assuming a single space as the word boundary:

... | perl -ple 's/(.*? ){10}\K/\n/g'

ciclistadan · Answer 5 · 2015-09-04 19:59:07Z

The same sed command can be simplified by specifying how many word-space patterns you want to match. I didn't have any big string files to test it out on, but without the loops in your original script this should run as fast as your processor can stream the data. Added benefit, it'll work equally well on multi-line files.

n=500; sed -r "s/((\w+\s){$n})/\1\n/g" <input.txt >output.txt

don_crissti · Answer 6 · 2015-09-04 21:35:15Z

up vote 1 down vote

Not really suitable when Number of words is a big number but if it's a small number (and ideally, no leading/trailing spaces in your one-line file) this should be quite fast (e.g. 5 words per line):

tr -s '[[:blank:]]' '\n' <input.txt | paste -d' ' - - - - - >output.txt

edited 7 hours ago

answered 7 hours ago

don_crissti
16k22861

add a comment |

Jelmer de Reus · Answer 7 · 2015-09-04 21:58:52Z

wonder how many msec this takes in go, just made a prototype didnt compile

//wordsplit.go

// go build wordsplit.go && ./wordsplit bigtext.txt

package main


import (
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "strings"
)


func main() {
    myfile, err := os.Open(os.Args[0])
    if err != nil {
        log.Fatal(err)
    }
    defer myfile.Close()
    data, err := ioutil.ReadAll()
    if err != nil {
        log.Fatal(err)
    }
    words := strings.Split(data, " ")
    newfile, err := os.Create("output.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer newfile.close
    for i := 0; i < len(words)-10; i+10 {
        newfile.WriteString(words[i:i+10])
    }
    newfile.WriteString(words[-(len(words)%10):])
    fmt.Printf("Formatted %s into 10 word lines in output.txt", os.Args[0])
}

steeldriver · Answer 8 · 2015-09-04 22:40:57Z

The coreutils pr command is another candidate: the only wrinkle seems to be that it is necessary to force the page width to be large enough to accommodate the output width.

Using a file created using @Glenn_Jackman's 10,000,000 word generator,

$ time tr '[[:blank:]]' '\n' < one.line | pr -s' ' -W 1000000 -JaT -1000 > many.line

real    0m2.113s
user    0m2.086s
sys 0m0.411s

where the counts are confirmed as follows

$ wc one.line multi.line 
        0  10000000  56608795 one.line
    10000  10000000  56608795 many.line
    10000  20000000 113217590 total

[Glenn's perl solution is still a little faster, ~1.8s on this machine].

asked	today
viewed	104 times
active	today

current community

your communities

more stack exchange communities

Split text file into lines with fixed number of words

8 Answers 8

Your Answer

Not the answer you're looking for? Browse other questions tagged text-processing sed awk split or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Split text file into lines with fixed number of words

8 Answers 8

Did you find this question interesting? Try our newsletter

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged text-processing sed awk split or ask your own question.

Linked

Related

Hot Network Questions