Yajl parsing error with gthubarchive.org JSON stream in Python

I am trying to parse a GitHub archive file using yajl-py. I believe that the main file format is a stream of JSON objects, so the file itself is not valid JSON, but it contains objects that are.

To verify this, I installed yajl-pyand then used their parser (from https://github.com/pykler/yajl-py/blob/master/examples/yajl_py_example.py ) to try to parse the file

python yajl_py_example.py < 2012-03-12-0.json

where 2012-03-12-0.jsonis one of the GitHub archive files that were unpacked.

It seems that such a thing should work from their reference implementation in Ruby. Python packages not handling JSON streams?

By the way, here is the error I get:

yajl.yajl_common.YajlError: parse error: trailing garbage
          9478bbc3","type":"PushEvent"}{"repository":{"url":"https://g
                     (right here) ------^
+5
source share
4

. Yajl , /. , , Python Yajl..

py-yajl iterload , , : https://github.com/rtyler/py-yajl/commit/a618f66005e9798af848c15d9aa35c60331e6687#L1R264

Python, Ruby :

# gem install yajl-ruby

require 'open-uri'
require 'zlib'
require 'yajl'

gz = open('http://data.githubarchive.org/2012-03-11-12.json.gz')
js = Zlib::GzipReader.new(gz).read

Yajl::Parser.parse(js) do |event|
  print event
end
+4

Yajl, allow_multiple_values . , .

--- a/examples/yajl_py_example.py
+++ b/examples/yajl_py_example.py
@@ -37,6 +37,7 @@ class ContentHandler(YajlContentHandler):

 def main(args):
     parser = YajlParser(ContentHandler())
+    parser.allow_multiple_values = True
     if args:
         for fn in args:
             f = open(fn)

Yajl-Py - yajl, , Yajl. yajl , :

yajl_allow_comments
yajl_dont_validate_strings
yajl_allow_trailing_garbage
yajl_allow_multiple_values
yajl_allow_partial_values

yajl-py, :

parser = YajlParser(ContentHandler())
# enabling these features, note that to make it more pythonic, the prefix `yajl_` was removed
parser.allow_comments = True
parser.dont_validate_strings = True
parser.allow_trailing_garbage = True
parser.allow_multiple_values = True
parser.allow_partial_values = True
# then go ahead and parse
parser.parse()
+1

, , . - github - , . :

{"json-key":"json-val", "sub-dict":{"sub-key":"sub-val"}}{"json-key2":"json-val2", "sub-dict2":{"sub-key2":"sub-val2"}}

, . json .

def read_next_dictionary(f):
    depth = 0
    json_str = ""
    while True:
        c = f.read(1)
        if not c:
            break #EOF
        json_str += str(c)
        if c == '{':
            depth += 1
        elif c == '}':
            depth -= 1

        if depth == 0:
            break

    return json_str

Github while:

arr_of_dicts = []
f = open(file_path)
while True:
    json_as_str = read_next_dictionary(f)
    try:
        json_dict = json.loads(json_as_str)
        arr_of_dicts.append(json_dict)
    except: 
        break # exception on loading json to end loop

pprint.pprint(arr_of_dicts)

: http://www.githubarchive.org/ ( gunzip)

+1

As a workaround, you can split the GitHub archive files into lines and then parse each line as json:

import json
with open('2013-05-31-10.json') as f:
    lines = f.read().splitlines()
    for line in lines:
        rec = json.loads(line)
        ...
-1
source

All Articles