NAME
RDF::NLP::SPARQLQuery - Perl extension for converting Natural Language Questions in SPARQL queries
SYNOPSIS
use RDF::NLP::SPARQLQuery;
my $NLQuestion = RDF::NLP::SPARQLQuery->new();
$NLQuestion->configFile("t/nlquestion.rc");
$NLQuestion->loadConfig;
$NLQuestion->format("SPARQL");
$NLQuestion->loadInput("examples/example1.qald");
my $outStr; $NLQuestion->Questions2Queries(\$outStr);
print $outStr;
DESCRIPTION
This module aims at querying RDF knowledge base with questions expressed in Natural language. Natural language questions are converted in SPARQL queries. The method is based on rules and resources. Resources are provided for querying the Drugbank (<http://www.drugbank.ca >), Diseasome (<http://diseasome.eu>) and Sider (<http://sideeffects.embl.de>).
The Natural language question has been already annotated with linguistic and semantic information. Input file provides this information (see details regarding the format in the section INPUT FORMAT).
The object 6 fields:
files
is a hashtablecontaining the name of the three files which are useful for running the converter (the configuration filename in the keyconfig
and the file name where the semantic correspondance and the rewriting rules are defined in the keysemtypecorresp
).config
contains the configuration structure.questions
contains the list of natural language questions.semtypecorresp
contains the semantic correspondance and the rewriting rules to egenerate the SPARQL queries.format
contains the format of the output. Accepted values areSPARQL
(the SPARQL query),XML
(the SPARQL query in the QALD challenge XML format),SPARQLANSWERS
(the answers return by the SPARQL query),XMLANSWERS
(the answers return by the SPARQL query in the QALD challenge XML format).verbose
specifies the verbose level.
METHODS
new()
new();
The method creates and returns a new converter for translating natural language questions in SPARQL queries.
format()
format($formatValue);
The method sets or returns the format of the output (accepted values are XML
, SPARQL
, XMLANSWERS
and SPARQLANSWERS
).
verbose()
verbose($verboseLevel);
The method sets or returns the level of theverbose mode (accepted values: 0 to 2)
loadConfig()
loadConfig();
The method loads the configuration from the file indicated in field files/config
(and returned by configFile).
loadInput()
loadInput($questionFile);
The method loads the questions from the file indicated in argument ($questionFile
). The method can be called several times to load several question files.
Questions2Queries()
Questions2Queries(\$outputStr);
The method runs the converter on the questions recorded in the field questions
. The results are returned in the variable $outputStr
).
config()
config();
The method sets or returns the configuration structure.
configFile()
configFile();
The method sets or returns the name of the configuration file.
semtypecorresp()
semtypecorresp();
The method sets or returns the semantic correspondance and the rewriting rules used to convert the natural language questions in SPARQL queries.
questions()
questions();
The methods returns the hashtable containing the natural language questions or initialises the hashtable. The keys are the identifier of the questions and the values are objects RDF::NLP::SPARQLQuery::Question
).
getQuestionList()
getQuestionList();
The method returns the list of questions (each question is an object RDF::NLP::SPARQLQuery::Question
).
questionIds()
questionIds();
The method returns the identifiers of the questions.
getQuestionFromId()
getQuestionFromId($questionId);
The method returns the question corresponding to the identifier $questionId
.
INPUT FORMAT
The input file is composed of several parts providing linguistic and semantic information on the natural language question:
the identifier of the question is introduced by
DOC:
on one line. For instance:DOC: question1
the definition of the language of the question is defined with
language:
on one line. For instance:language: EN
the list of the sentence(s) is introducted by the keyword
sentence:
and ends with the keyword_END_SENT_
(both in one line). For instance:sentence: Which diseases is Cetuximab used for? _END_SENT_
the morpho-syntactic information associated to each word is introduced by the keyword
word information:
ends with the keyword_END_POSTAG_
(both in one line). Each line contains 4 information separated by tabulations: the inflected form of the word, its part-of-speech tag, its lemma and its offset (in number of characters). For instance:word information: Which WDT which 10 diseases NNS disease 16 is VBZ be 25 Cetuximab VBN Cetuximab 28 used VBN use 38 for IN for 43 ? SENT ? 46 _END_POSTAG_
the semantic entities and associated semantic information is introduced by the keyword
semantic units:
ends with the keyword_END_SEM_UNIT_
(both in one line). Each line contains 5 information separated by tabulations: the semantic entity, its canonical form, its semantic types (separated by column), its start offset and its end offset (in number of characters). For instance:semantic units: # term form<tab>term canonical form<tab>semantic features<tab>offset start<tab>offset end (ended by _END_SEM_UNIT_) diseases diseas disease:disease 16 23 Cetuximab Cetuximab drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002 28 36 used for used for possibleDrug:possibleDrug 38 45 Cetuximab Cetuximab drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002 28 36 diseases diseas disease:disease 16 23 used for used for possibleDrug:possibleDrug 38 45 _END_SEM_UNIT_
Semantic types can be decomposed in subtypes. They are coded in the same way as a unix file path.
NB: Comments are introduced by the character #
. Empty lines are ignored.
Examples of files are available in the example
of the archive.
CONFIGURATION FILE FORMAT
The configuration file format is similar to the Apache configuration format. The module Config::General
is used to read the file. There are sections named NLQUESTION
for each language (identified with the attribute language
). Each section defines the following variables defining the behaviour of the script:
VERBOSE
: it defines the verbose mode level similarly to the option--verbose
. It is overwritten by this option.REGEXFORM
: this boolean variable indicates if in case of use of regex, the inflected form (value 1) or canonical form (value 0) is used.UNION
: this boolean variable indicates if the union is used or notSEMANTICTYPECORRESPONDANCE
: this variable defines the file containing the semantic information (rewriting rules, semantic correspondance, etc.) to generate the SPARQL queriesURL_PREFIX
: it specifies the begining of the URL (before the SPARQL query) when the query is sent to a virtuoso server.URL_SUFFIX
: it specifies the end of the URL (before the SPARQL query) when the query is sent to a virtuoso server.
SEE ALSO
QALD challenge web page: <http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/index.php?x=task2&q=4>
Natural Language Question Analysis for Querying Biomedical Linked Data Thierry Hamon, Natalia Grabar, and Fleur Mougin. Natural Language Interfaces for Web of Data (NLIWod 2014). 2014. To appear.
AUTHORS
Thierry Hamon, <hamon@limsi.fr>
LICENSE
Copyright (C) 2014 by Thierry Hamon
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.1 or, at your option, any later version of Perl 5 you may have available.