Skip to content

Page link parser needs to distinguish namespace #97

@daxenberger

Description

@daxenberger

Originally reported on Google Code with ID 103

In the page_inlink.txt file, some page links (e.g. "Henry_Hutchinson" -> "Stub") are
wrong.
This is because in the page link parser, namespace is not distinguished (e.g. some
pages link to "Wikipedia:Stub" rather than "Stub").

I suggest to modify the method:
public void processPageLinksRow(PagelinksParser plParser)
in SingleDumpVersionJDK.java
from

public void processPageLinksRow(PagelinksParser plParser)
            throws IOException {
        int pl_from = plParser.getPlFrom();
        String pl_to = plParser.getPlTo();
        if (pl_to != null) {
                KeyType plToHash = (KeyType) hashAlgorithm.hashCode(pl_to);
                Integer pl_toValue = pNamePageIdMap.get(plToHash);
                // skip redirects if skipPage is enabled
                if ((!skipPage || pPageIdNameMap.containsKey(pl_from))
                        && pl_toValue != null) {
                    pageOutlinks.addRow(pl_from, pl_toValue);
                    pageInlinks.addRow(pl_toValue, pl_from);
                }

        }
    }

to

public void processPageLinksRow(PagelinksParser plParser)
            throws IOException {
        int pl_from = plParser.getPlFrom();
        String pl_to = plParser.getPlTo();
        int pl_namespace = plParser.getPlNamespace();
        if (pl_to != null) {
            switch (pl_namespace) {
            case NS_MAIN: {
                KeyType plToHash = (KeyType) hashAlgorithm.hashCode(pl_to);
                Integer pl_toValue = pNamePageIdMap.get(plToHash);
                // skip redirects if skipPage is enabled
                if ((!skipPage || pPageIdNameMap.containsKey(pl_from))
                        && pl_toValue != null) {
                    pageOutlinks.addRow(pl_from, pl_toValue);
                    pageInlinks.addRow(pl_toValue, pl_from);
                }
            }
            }

        }
    }

Reported by astronautguo on 2012-09-20 16:24:42

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions