Non-linear slowdown creating lazy segment in Clojure

I implemented a function that returns n-grams of a given input collection as lazy seq.

(defn gen-ngrams
  [n coll]
  (if (>= (count coll) n)
    (lazy-seq (cons (take n coll) (gen-ngrams n (rest coll))))))

When I call this function with large collections of input, I expect to see a linear increase in runtime. However, the time I observe is worse than this:

user> (time (count (gen-ngrams 3 (take 1000 corpus))))
"Elapsed time: 59.426 msecs"
998
user> (time (count (gen-ngrams 3 (take 10000 corpus))))
"Elapsed time: 5863.971 msecs"
9998
user> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 23584.226 msecs"
19998
user> (time (count (gen-ngrams 3 (take 30000 corpus))))
"Elapsed time: 54905.999 msecs"
29998
user> (time (count (gen-ngrams 3 (take 40000 corpus))))
"Elapsed time: 100978.962 msecs"
39998

corpus- Consstring tokens.

What causes this behavior and how to improve performance?

+3
source share
2 answers

I think your problem is with "(count coll)", which iterates over the code for every ngrams call.

The solution is to use the build function in the section:

user=> (time (count (gen-ngrams 3 (take 20000 corpus))))
"Elapsed time: 6212.894932 msecs"
19998
user=> (time (count (partition 3 1 (take 20000 corpus))))
"Elapsed time: 12.57996 msecs"
19998

, http://clojuredocs.org/clojure_core/clojure.core/partition

+5

Clojure, , cons . :

(defn gen-ngrams
  [n coll]
  (if (>= (count coll) n)
    (lazy-seq (list (take n coll) (gen-ngrams n (rest coll))))))

, cons seq, , , , , .

: "corpus - " "), ...

0

All Articles