Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

Related, but no satisfactory answers: How can I split a large text file into chunks of 500 words or so?

I'm trying to take a text file (http://mattmahoney.net/dc/text8.zip) with > 10^7 words all in one line, and split it into lines with N words each. My current approach works, but is fairly slow and ugly (using shell script):

i=0
for word in $(sed -e 's/\s\+/\n/g' input.txt)
do
    echo -n "${word} " > output.txt
    let "i=i+1"

    if [ "$i" -eq "1000" ]
    then
        echo > output.txt
        let "i=0"
    fi
done

Any tips on how I can make this faster or more compact?

share|improve this question
    
if you want it faster, you need to use something else then bash script. I would recommend some C. It can fit to few lines. –  Jakuje 9 hours ago

8 Answers 8

up vote 3 down vote accepted

Assuming your definition of word is a sequence of non-blank characters separated by blanks, here's an awk solution for your single-line file

awk '{for (i=1; i<=NF; ++i)printf "%s%s", $i, i % 500? " ": "\n"}i % 500{print ""}' file
share|improve this answer
    
This is exactly what I needed, thanks! –  Cory Schillaci 9 hours ago

Use xargs (17 seconds):

xargs -n1000 <file >output

It uses the -n flag of xargs which defines the max number of arguments. Just change 1000 to 500 or whatever limit you want.

I made a test file with 10^7 words:

$ wc -w file
10000000 file

Here are the time stats:

$ time xargs -n1000 <file >output
real    0m16.677s
user    0m1.084s
sys     0m0.744s
share|improve this answer
    
This is slightly slower than the answer I accepted (21s vs 12s on my file) –  Cory Schillaci 9 hours ago
1  
Excellent idea +1, however beware xargs's quote-stripping behaviour –  1_CR 9 hours ago
    
The lower the n the slower this will get, just so you know. With -n10 I cancelled it after about 8 minutes of waiting... –  don_crissti 8 hours ago

Perl seems quite astonishingly good at this:

Create a file with 10,000,000 space separated words

for ((i=1; i<=10000000; i++)); do printf "%s " $RANDOM ; done > one.line

Now, perl to add a newline after each 1,000 words

time perl -pe '
    s{ 
        (?:\S+\s+){999} \S+   # 1000 words
        \K                    # then reset start of match
        \s+                   # and the next bit of whitespace
    }
    {\n}gx                    # replace whitespace with newline
' one.line > many.line

Timing

real    0m1.074s
user    0m0.996s
sys     0m0.076s

verify results

$ wc one.line many.line
        0  10000000  56608931 one.line
    10000  10000000  56608931 many.line
    10000  20000000 113217862 total

The accepted awk solution took just over 5 sec on my input file.

share|improve this answer

The venerable fmt(1) command, while not strictly operating on "a particular number of words" can fairly quickly wrap long lines to a particular goal (or maximum) width:

perl -e 'for (1..100) { print "a"x int 3+rand(7), " " }' | fmt

Or with modern perl, for a specific number of words, say, 10, and assuming a single space as the word boundary:

... | perl -ple 's/(.*? ){10}\K/\n/g'
share|improve this answer

The same sed command can be simplified by specifying how many word-space patterns you want to match. I didn't have any big string files to test it out on, but without the loops in your original script this should run as fast as your processor can stream the data. Added benefit, it'll work equally well on multi-line files.

n=500; sed -r "s/((\w+\s){$n})/\1\n/g" <input.txt >output.txt
share|improve this answer

Not really suitable when Number of words is a big number but if it's a small number (and ideally, no leading/trailing spaces in your one-line file) this should be quite fast (e.g. 5 words per line):

tr -s '[[:blank:]]' '\n' <input.txt | paste -d' ' - - - - - >output.txt
share|improve this answer

wonder how many msec this takes in go, just made a prototype didnt compile

//wordsplit.go

// go build wordsplit.go && ./wordsplit bigtext.txt

package main


import (
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "strings"
)


func main() {
    myfile, err := os.Open(os.Args[0])
    if err != nil {
        log.Fatal(err)
    }
    defer myfile.Close()
    data, err := ioutil.ReadAll()
    if err != nil {
        log.Fatal(err)
    }
    words := strings.Split(data, " ")
    newfile, err := os.Create("output.txt")
    if err != nil {
        log.Fatal(err)
    }
    defer newfile.close
    for i := 0; i < len(words)-10; i+10 {
        newfile.WriteString(words[i:i+10])
    }
    newfile.WriteString(words[-(len(words)%10):])
    fmt.Printf("Formatted %s into 10 word lines in output.txt", os.Args[0])
}
share|improve this answer

The coreutils pr command is another candidate: the only wrinkle seems to be that it is necessary to force the page width to be large enough to accommodate the output width.

Using a file created using @Glenn_Jackman's 10,000,000 word generator,

$ time tr '[[:blank:]]' '\n' < one.line | pr -s' ' -W 1000000 -JaT -1000 > many.line

real    0m2.113s
user    0m2.086s
sys 0m0.411s

where the counts are confirmed as follows

$ wc one.line multi.line 
        0  10000000  56608795 one.line
    10000  10000000  56608795 many.line
    10000  20000000 113217590 total

[Glenn's perl solution is still a little faster, ~1.8s on this machine].

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.