Wikipedia Links to pages by page

What ?:
I'm trying to get a link to a map of the page (s) of a wikipedia page page_idin the following format:

from1 to1 to2 to3 ...
from2 to1 to2 to3 ...
...

Why ?:
I'm looking for a dataset (Wikipedia pages) to try out PageRank.

Problem:
At dumps.wikimedia.org you can download pages-articles.xml , which is XML with this format:

<page>
  <title>...</title>
  <id>...</id>          // pageid
  <text>...</text>
</page>

which I will use to retrieve articles ( text), as well as a database for each page ( page.sql ), which contains some information about the pages page_idand the last ones that seem to be relevant to me, pagelinks.sql , which contains page link records. The problem is that the table pagelinkshas the following fields: pl_from, pl_namespaceand pl_title.

Idea: create a temporary database, import the tables pageand pagelinksand create this matrix using the table pagelinksand extract page_idaccording to pl_titles. Possible Solution:

SELECT pl_from, GROUP_CONCAT(page_id SEPARATOR ' ') FROM pagelinks
    JOIN page ON 
        pl_title = page_title AND pl_namespace = page_namespace
GROUP BY pl_from

or to get a backlink map ( to1 from1 from2 from3 ..., not from1 to1 to2 to3 ...):

SELECT page_id, GROUP_CONCAT(pl_from SEPARATOR ' ') FROM pagelinks
    JOIN page ON 
        pl_title = page_title AND pl_namespace = page_namespace
GROUP BY page_id

:
, ​​ page_id, ? , , ​​, , ?

+5
2

, , , ( pages-articles.xml).

sql, . , . Net.

+1

, XML, , :

http://haselgrove.id.au/wikipedia.htm

, .m(MATLAB, OCTAVE), . , pre- .txt. , . 2009 .

0

All Articles