NAME
nlquestion2sparqlquery - Perl script for converting Natural Language Questions in SPARQL queries
SYNOPSIS
nlquestion2sparqlquery [option] --input <FILENAME>
OPTIONS AND ARGUMENTS
--input=filename, -i filename
This option defines the input file to load. If the filename is
-
(or the option is not specified), the input data is read on STDIN.--output <filename>
This option defines the output file to load. If the filename is
-
(or the option is not specified), the output data is print on STDOUT.--rcfile=file, -c file
Load the given configuration file.
--answer, -a
This option specifies if the answers are returned (otherwise, the SPARQL query is returned)
--format [XML|SPARQL], -f [XML|SPARQL]
This option defines the format of the output:
XML: the output is in XML, as required by the QALD challenge
SPARQL: the output is the SPARQL query or the list of answers
--help
Print help message for using
nlquestion2sparqlquery
--man
Print man page of
nlquestion2sparqlquery
--verbose, -v
Go into the verbose mode. Note that the verbosity can be increased by using several times the option.
--debug, -D
Switch in debug mode for the script
nlquestion2sparqlquery
(the switch has no influence on the object code).
DESCRIPTION
This script aims at querying RDF knowledge base with questions expressed in Natural language. Natural language questions are converted in SPARQL queries. The method is based on rules and resources. Resources are provided for querying the Drugbank (<http://www.drugbank.ca>), Diseasome (<http://diseasome.eu>) and Sider (<http://sideeffects.embl.de>).
The Natural language question has been already annotated with linguistic and semantic information. Input file provides this information (see details regarding the format in the section INPUT FORMAT).
If you use this software, please cite:
Natural Language Question Analysis for Querying Biomedical Linked Data Thierry Hamon, Natalia Grabar, and Fleur Mougin. Natural Language Interfaces for Web of Data (NLIWod 2014). 2014. To appear.
EXAMPLES of USE
Tu run the script, a configuration file is needed (usually nlquestion.rc in /etc/nlquestion
- see section CONFIGURATION FILE FORMAT for more details. An example of the configuration file is available in etc/nlquestion/nlquestion.rc
from the archive directory.
The most common command line to run nlquestion2sparqlquery is
nlquestion2sparqlquery -i example1.qald
It is assumed that the directory containing the program nlquestion2sparqlquery is in your PATH variable and that the configuration file is
/etc/nlquestion/nlquestion.rc
.The SPARQL query is printed on the STDOUT in QALD XML format.
If you are not allow to copy the configuration file
nlquestion.rc
in the directory/etc/nlquestion
(or create this directory), or if you want to use your own configuration file, you can specify the file with its path by using the option--rcfile
nlquestion2sparqlquery --rcfile nlquestion2.rc -i example1.qald
you can also change the format and record the results in a file
nlquestion2sparqlquery --rcfile nlquestion2.rc -i example1.qald -f SPARQL -a -o example1.out
INPUT FORMAT
The input file is composed of several parts providing linguistic and semantic information on the natural language question:
the identifier of the question is introduced by
DOC:
on one line. For instance:DOC: question1
The end of the information associated to the document is marked by the keyword
_END_DOC_
.the definition of the language of the question is defined with
language:
on one line. For instance:language: EN
the list of the sentence(s) is introducted by the keyword
sentence:
and ends with the keyword_END_SENT_
(both in one line). For instance:sentence: Which diseases is Cetuximab used for? _END_SENT_
the morpho-syntactic information associated to each word is introduced by the keyword
word information:
ends with the keyword_END_POSTAG_
(both in one line). Each line contains 4 information separated by tabulations: the inflected form of the word, its part-of-speech tag, its lemma and its offset (in number of characters). For instance:word information: Which WDT which 10 diseases NNS disease 16 is VBZ be 25 Cetuximab VBN Cetuximab 28 used VBN use 38 for IN for 43 ? SENT ? 46 _END_POSTAG_
the semantic entities and associated semantic information is introduced by the keyword
semantic units:
ends with the keyword_END_SEM_UNIT_
(both in one line). Each line contains 5 information separated by tabulations: the semantic entity, its canonical form, its semantic types (separated by column), its start offset and its end offset (in number of characters). For instance:semantic units: # term form<tab>term canonical form<tab>semantic features<tab>offset start<tab>offset end (ended by _END_SEM_UNIT_) diseases diseas disease:disease 16 23 Cetuximab Cetuximab drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002 28 36 used for used for possibleDrug:possibleDrug 38 45 Cetuximab Cetuximab drug/drugbank/gen/DB00002:drug/drugbank/gen/DB00002 28 36 diseases diseas disease:disease 16 23 used for used for possibleDrug:possibleDrug 38 45 _END_SEM_UNIT_
Semantic types can be decomposed in subtypes. They are coded in the same way as a unix file path.
NB: Comments are introduced by the character #
. Empty lines are ignored.
Examples of files are available in the example
of the archive.
CONFIGURATION FILE FORMAT
The configuration file format is similar to the Apache configuration format. The module Config::General
is used to read the file. There are sections named NLQUESTION
for each language (identified with the attribute language
). Each section defines the following variables defining the behaviour of the script:
VERBOSE
: it defines the verbose mode level similarly to the option--verbose
. It is overwritten by this option.REGEXFORM
: this boolean variable indicates if in case of use of regex, the inflected form (value 1) or canonical form (value 0) is used.UNION
: this boolean variable indicates if the union is used or notSEMANTICTYPECORRESPONDANCE
: this variable defines the file containing the semantic information (rewriting rules, semantic correspondance, etc.) to generate the SPARQL queriesURL_PREFIX
: it specifies the begining of the URL (before the SPARQL query) when the query is sent to a virtuoso server.URL_SUFFIX
: it specifies the end of the URL (before the SPARQL query) when the query is sent to a virtuoso server.
SEE ALSO
QALD challenge web page: <http://greententacle.techfak.uni-bielefeld.de/~cunger/qald/index.php?x=task2&q=4>
Natural Language Question Analysis for Querying Biomedical Linked Data Thierry Hamon, Natalia Grabar, and Fleur Mougin. Natural Language Interfaces for Web of Data (NLIWod 2014). 2014. To appear.
AUTHOR
Thierry Hamon, <hamon@limsi.fr>
COPYRIGHT AND LICENSE
Copyright (C) 2014 Thierry Hamon
This is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.14.2 or, at your option, any later version of Perl 5 you may have available.