NAME

RDF::NLP::SPARQLQuery - Perl extension for converting Natural Language Questions in SPARQL queries

SYNOPSIS

use RDF::NLP::SPARQLQuery;

my $NLQuestion = RDF::NLP::SPARQLQuery->new();

$NLQuestion->configFile("t/nlquestion.rc");

$NLQuestion->loadConfig;

$NLQuestion->format("SPARQL");

$NLQuestion->loadInput("examples/example1.qald");

my $outStr; $NLQuestion->Questions2Queries(\$outStr);

print $outStr;

DESCRIPTION

This module aims at querying RDF knowledge base with questions expressed in Natural language. Natural language questions are converted in SPARQL queries. The method is based on rules and resources. Resources are provided for querying the Drugbank (<http://www.drugbank.ca >), Diseasome (<http://diseasome.eu>) and Sider (<http://sideeffects.embl.de>).

The Natural language question has been already annotated with linguistic and semantic information. Input file provides this information (see details regarding the format in the section INPUT FORMAT).

The object 6 fields:

  • files is a hashtablecontaining the name of the three files which are useful for running the converter (the configuration filename in the key config and the file name where the semantic correspondance and the rewriting rules are defined in the key semtypecorresp).

  • config contains the configuration structure.

  • questions contains the list of natural language questions.

  • semtypecorresp contains the semantic correspondance and the rewriting rules to egenerate the SPARQL queries.

  • format contains the format of the output. Accepted values are SPARQL (the SPARQL query), XML (the SPARQL query in the QALD challenge XML format), SPARQLANSWERS (the answers return by the SPARQL query), XMLANSWERS (the answers return by the SPARQL query in the QALD challenge XML format).

  • verbose specifies the verbose level.

METHODS

new()

new();

The method creates and returns a new converter for translating natural language questions in SPARQL queries.

format()

format($formatValue);

The method sets or returns the format of the output (accepted values are XML, SPARQL, XMLANSWERS and SPARQLANSWERS).

verbose()

verbose($verboseLevel);

The method sets or returns the level of theverbose mode (accepted values: 0 to 2)

loadConfig()

loadConfig();

The method loads the configuration from the file indicated in field files/config (and returned by configFile).

loadInput()

loadInput($questionFile);

The method loads the questions from the file indicated in argument ($questionFile). The method can be called several times to load several question files.

Questions2Queries()

Questions2Queries(\$outputStr);

The method runs the converter on the questions recorded in the field questions. The results are returned in the variable $outputStr).

config()

config();

The method sets or returns the configuration structure.

configFile()

configFile();

The method sets or returns the name of the configuration file.

semtypecorresp()

semtypecorresp();

The method sets or returns the semantic correspondance and the rewriting rules used to convert the natural language questions in SPARQL queries.

questions()

questions();

The methods returns the hashtable containing the natural language questions or initialises the hashtable. The keys are the identifier of the questions and the values are objects RDF::NLP::SPARQLQuery::Question).

getQuestionList()

getQuestionList();

The method returns the list of questions (each question is an object RDF::NLP::SPARQLQuery::Question).

questionIds()

questionIds();

The method returns the identifiers of the questions.

getQuestionFromId()

getQuestionFromId($questionId);

The method returns the question corresponding to the identifier $questionId.

INPUT FORMAT

The input file is composed of several parts providing linguistic and semantic information on the natural language question:

  • the identifier of the question is introduced by DOC: on one line. For instance:

    DOC: question1
  • the definition of the language of the question is defined with language: on one line. For instance:

    language: EN
  • the list of the sentence(s) is introducted by the keyword sentence: and ends with the keyword _END_SENT_ (both in one line). For instance:

    sentence:
    Which diseases is Cetuximab used for?
    _END_SENT_
  • the morpho-syntactic information associated to each word is introduced by the keyword word information: ends with the keyword _END_POSTAG_ (both in one line). Each line contains 4 information separated by tabulations: the inflected form of the word, its part-of-speech tag, its lemma and its offset (in number of characters). For instance:

    word information:
    Which	WDT	which	10	
    diseases	NNS	disease	16	
    is	VBZ	be	25	
    Cetuximab	VBN	Cetuximab	28	
    used	VBN	use	38	
    for	IN	for	43	
    ?	SENT	?	46	
    _END_POSTAG_
  • the semantic entities and associated semantic information is introduced by the keyword semantic units: ends with the keyword _END_SEM_UNIT_ (both in one line). Each line contains 5 information separated by tabulations: the semantic entity, its canonical form, its semantic types (separated by column), its start offset and its end offset (in number of characters). For instance:

    semantic units:
    # term form<tab>term canonical form<tab>semantic features<tab>offset start<tab>offset end (ended by _END_SEM_UNIT_)
    diseases	diseas	disease:disease	16	23
    Cetuximab	Cetuximab	drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002	28	36
    used for	used for	possibleDrug:possibleDrug	38	45
    Cetuximab	Cetuximab	drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002	28	36
    diseases	diseas	disease:disease	16	23
    used for	used for	possibleDrug:possibleDrug	38	45
    _END_SEM_UNIT_

    Semantic types can be decomposed in subtypes. They are coded in the same way as a unix file path.

NB: Comments are introduced by the character #. Empty lines are ignored.

Examples of files are available in the example of the archive.

CONFIGURATION FILE FORMAT

The configuration file format is similar to the Apache configuration format. The module Config::General is used to read the file. There are sections named NLQUESTION for each language (identified with the attribute language). Each section defines the following variables defining the behaviour of the script:

  • VERBOSE: it defines the verbose mode level similarly to the option --verbose. It is overwritten by this option.

  • REGEXFORM: this boolean variable indicates if in case of use of regex, the inflected form (value 1) or canonical form (value 0) is used.

  • UNION: this boolean variable indicates if the union is used or not

  • SEMANTICTYPECORRESPONDANCE: this variable defines the file containing the semantic information (rewriting rules, semantic correspondance, etc.) to generate the SPARQL queries

  • URL_PREFIX: it specifies the begining of the URL (before the SPARQL query) when the query is sent to a virtuoso server.

  • URL_SUFFIX: it specifies the end of the URL (before the SPARQL query) when the query is sent to a virtuoso server.

SEE ALSO

QALD challenge web page: <http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/index.php?x=task2&q=4>

Natural Language Question Analysis for Querying Biomedical Linked Data Thierry Hamon, Natalia Grabar, and Fleur Mougin. Natural Language Interfaces for Web of Data (NLIWod 2014). 2014. To appear.

AUTHORS

Thierry Hamon, <hamon@limsi.fr>

LICENSE

Copyright (C) 2014 by Thierry Hamon

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.1 or, at your option, any later version of Perl 5 you may have available.