C # parsing a Freebase RDF dump yields just 11.5 million N-triples instead of 1.9 billion

I am working on a C # program to read RDF data in a Google Freebase data dump . To get started, I wrote a simple loop to just read the file and get a triple count. However, instead of getting 1.9 billion. Accounts, as indicated on the documentation page (see above), my program totals only about 11.5 million. And then it comes out. The corresponding part of the source code is shown below (it takes about 30 seconds to run).

What am I missing here?

// Simple reading through the gz file
try
{
    using (FileStream fileToDecompress = File.Open(@"C:\Users\Krishna\Downloads\freebase-rdf-2014-02-16-00-00.gz", FileMode.Open))
    {
        int tupleCount = 0;
        string readLine = "";

        using (GZipStream decompressionStream = new GZipStream(fileToDecompress, CompressionMode.Decompress))
        {
            StreamReader sr = new StreamReader(decompressionStream, detectEncodingFromByteOrderMarks: true);

            while (true)
            {
                readLine = sr.ReadLine();
                if (readLine != null)
                {
                    tupleCount++;
                    if (tupleCount % 1000000 == 0)
                    { Console.WriteLine(DateTime.Now.ToShortTimeString() + ": " + tupleCount.ToString()); }
                }
                else
                { break; }
            }
            Console.WriteLine("Tuples: " + tupleCount.ToString());
        }
    }
}
catch (Exception ex)
{ Console.WriteLine(ex.Message); }

( GZippedNTriplesParser dotNetRdf, , , , RdfParseException (Tab delimiters? UTF-8??). , , ).

+3
3

FreeBase RDF map/reduce, 200 Gzip. 200 Gzip. Gzip, Gzip Gzip. , , .

, , , 199. #, fooobar.com/questions/2110992/..., , DotNetZip .

+2

DotNetZip GzipDecorator "gzipped chunks".

sealed class GzipDecorator : Stream
{
    private readonly Stream _readStream;
    private GZipStream _gzip;
    private long _totalIn;
    private long _totalOut;

    public GzipDecorator(Stream readStream)
    {
        Throw.IfArgumentNull(readStream, "readStream");
        _readStream = readStream;
        _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
    }

    public override int Read(byte[] buffer, int offset, int count)
    {
        var bytesRead = _gzip.Read(buffer, offset, count);
        if (bytesRead <= 0 && _readStream.Position < _readStream.Length)
        {
            _totalIn += _gzip.TotalIn + 18;
            _totalOut += _gzip.TotalOut;
            _gzip.Dispose();
            _readStream.Position = _totalIn;
            _gzip = new GZipStream(_readStream, CompressionMode.Decompress, true);
            bytesRead = _gzip.Read(buffer, offset, count);
        }
        return bytesRead;
    }
}
+1

I managed to solve the problem by repacking the dump using the "7-zip" archiver. Maybe this will help you.

0
source

All Articles