What ?:
I'm trying to get a link to a map of the page (s) of a wikipedia page page_idin the following format:
from1 to1 to2 to3 ...
from2 to1 to2 to3 ...
...
Why ?:
I'm looking for a dataset (Wikipedia pages) to try out PageRank.
Problem:
At dumps.wikimedia.org you can download pages-articles.xml , which is XML with this format:
<page>
<title>...</title>
<id>...</id> // pageid
<text>...</text>
</page>
which I will use to retrieve articles ( text), as well as a database for each page ( page.sql ), which contains some information about the pages page_idand the last ones that seem to be relevant to me, pagelinks.sql , which contains page link records. The problem is that the table pagelinkshas the following fields: pl_from, pl_namespaceand pl_title.
Idea: create a temporary database, import the tables pageand pagelinksand create this matrix using the table pagelinksand extract page_idaccording to pl_titles. Possible Solution:
SELECT pl_from, GROUP_CONCAT(page_id SEPARATOR ' ') FROM pagelinks
JOIN page ON
pl_title = page_title AND pl_namespace = page_namespace
GROUP BY pl_from
or to get a backlink map ( to1 from1 from2 from3 ..., not from1 to1 to2 to3 ...):
SELECT page_id, GROUP_CONCAT(pl_from SEPARATOR ' ') FROM pagelinks
JOIN page ON
pl_title = page_title AND pl_namespace = page_namespace
GROUP BY page_id
:
, ββ page_id, ?
, , ββ, , ?