Markov Sequences: Getting Test Corpuses from Reddit (Part 3)

In my previous post I provided a solution for creating plausible random strings of text using Clojure. But our random sequence generator needs to be trained on some data and we don’t yet have any. So I have written a short blog post about getting some test data from the reddit social media website.

Why Choose Reddit?

Reddit has a very accessible API which gives us access to pretty much everything on reddit. Reddit also has a vast amount of content posted on it each day, making it perfect for collecting data.

The Data

The data I will be collecting from reddit will be comments. These can easily be accessed using the urls of the following form:

http://www.reddit.com/r/{subreddit}/comments.json

Where {subreddit} is replaced by the name of the subreddit we are retrieving comments for.

Retrieving the Data

Retrieving the data is performed using the excellent clj-http library. Because of this all functions defined below are in the following namespace (unless specified otherwise).

(ns markovtext.reddit
  (:gen-class)
  (:require
   [clj-http.client :as client]
   [clojure.string :as string])
  (:import [java.net URLEncoder]))

Using clj-http I have written the following function which creats HTTP GET requests and returns a Clojure data structure parsed from the JSON in the response of the request.

(defn urlopen [url data cookie]
  (let [response (client/get (string/join [url "?" (format-data data)])
                             {:headers {"User-Agent" "reddit.clj"}
                              :cookies cookie
                              :as :json
                              :socket-timeout 10000
                              :conn-timeout 10000})]
    (if (= 200 (:status response))
      (:body response)
      nil)))

With the format-data function being defined as below:

(defn format-data [data]
  (->> data
       ;; Convert keys and values to suitable form
       (map (fn [[k v]] [(name k) (URLEncoder/encode (str v))]))
       ;; Join up pairs with "="
       (map #(string/join #"=" %))
       ;; Join the kv pairs
       (string/join #"&")))

We are now ready to make requests to reddit to get data. The following function makes a request to a url (will probably only return something useful for a reddit url):

(defn get-listing [url after]
  (->> (urlopen url (if (nil? after) {:limit 1000} {:limit 1000 :after after}) nil)
       ;; the interesting bits are in the data field
       (:data)))

Because reddit limits the number of items returned in a single RESTful request, to gather large numbers of comments, we need to make multiple request. However reddit also rate limits requests at 30 requests per minute. I’ve written a function (shown below) which allows one to make an unbounded number of requests to reddit without breaking the API rules.

(defn get-listings
  ([url after]
     (Thread/sleep 2000)
     (let [listing (get-listing url after)]
       (lazy-cat (:children listing) (get-listings url (:after listing)))))
  ([url]
     (let [listing (get-listing url nil)]
       (lazy-cat (:children listing) (get-listings url (:after listing))))))

This function makes use of the lazy-cat function in clojure.core to create a lazy sequence of reddit listings. Note that I use the rather hacky Thread/sleep to ensure complicity with the API rules and avoid rate limiting. Using a constant sleep time means that requests are being made at a slightly slower rate than allowed by reddit. Although not perfect, this solution will suffice for now.

Now we can get listings, we can now start retrieving comments. Below the reddit-corpus function is defined along with a couple of helper functions. reddit-corpus returns a lazy sequence of comment bodies. That is, it strips all other information from each reddit listing other than the body of the comment.

(defn process-reddit [st]
  (-> st
      (string/replace #"\((http://)?[\w\.\?&]+\)" "")
      (string/replace #"[\*\(\)\[\]]" "")
      (string/replace #"\d\)" "")))

(defn sr-comments [sr]
  (map
   :data
   (get-listings (string/join [redditurl "/r/" sr "/comments.json"]))))

(defn reddit-corpus [sr]
  (map
   (comp process-reddit :body)
   (sr-comments sr)))

Generating Random Sequences from this Data

For this section assume that all of the above functions are imported under the reddit namespace. Also one should assume that the TransitonTable protocol and its extension over Clojure’s map (which were defined in the previous blog post) are in imported under the tt name space.

Using the functions defined in the previous blog post one can open up the REPL and run the following code to generate a random sequence of words based on a transition table created from reddit comments.

> (def corpus (reddit/reddit-corpus "technology"))
> (take 20 (corpus-likely-seq (take 1000 corpus) 1))

This will generate a sequence of twenty words using a Markov chain where $n=1$ using 1000 reddit comments from the technology subreddit. Using the above code the following sequence was generated

("alterior" "motive" "," "and" "potential" 
"solution" "."  "npd" "," "they're" "trying" 
"to" "be" "considered" "non-trivial" "since" "your" "mother" "," "so")

We can then turn this sequence into a single string using the following function which ensures that punctuation is properly spaced:

(defn stringify-seq [seq]
  (->>
   seq
   (replace { "i" "I"})
   (clojure.string/join " ")
   (#(clojure.string/replace % #" +[\.\?,]" (fn [s] (str (second s)))))))

When running this function on the returned sequence one will get something which looks like what is shown below.

“alterior motive, and potential solution. npd, they’re trying to be considered non-trivial since your mother, so”

As you can see the quote doesn’t make much sense, although it does role off the tongue reasonably well. To generate more realistic sentences we’ll probably have to use higher values for $n$ . For that we’ll need a much larger transition table (the number of keys in the table to the power $n$ ).

In the next blog post I will look at continually updating the transition table and storing them in a redis key-value store to allow us to better scale things up.