I am working on a C # program to read RDF data in a Google Freebase data dump . To get started, I wrote a simple loop to just read the file and get a triple count. However, instead of getting 1.9 billion. Accounts, as indicated on the documentation page (see above), my program totals only about 11.5 million. And then it comes out. The corresponding part of the source code is shown below (it takes about 30 seconds to run).
What am I missing here?
try
{
using (FileStream fileToDecompress = File.Open(@"C:\Users\Krishna\Downloads\freebase-rdf-2014-02-16-00-00.gz", FileMode.Open))
{
int tupleCount = 0;
string readLine = "";
using (GZipStream decompressionStream = new GZipStream(fileToDecompress, CompressionMode.Decompress))
{
StreamReader sr = new StreamReader(decompressionStream, detectEncodingFromByteOrderMarks: true);
while (true)
{
readLine = sr.ReadLine();
if (readLine != null)
{
tupleCount++;
if (tupleCount % 1000000 == 0)
{ Console.WriteLine(DateTime.Now.ToShortTimeString() + ": " + tupleCount.ToString()); }
}
else
{ break; }
}
Console.WriteLine("Tuples: " + tupleCount.ToString());
}
}
}
catch (Exception ex)
{ Console.WriteLine(ex.Message); }
( GZippedNTriplesParser dotNetRdf, , , , RdfParseException (Tab delimiters? UTF-8??). , , ).