This document describes FreebaseTools 1.2.0 or later.

Overview
========

FreebaseTools is a small toolkit to pre-process, filter, index and store Google's
Freebase knowledge base in a fast and relatively "small" Lucene index.  KB Variants
such as BaseKB Gold which is used as the reference KB for TAC-KBP can also be handled.

Main Features:

- significant size reduction: reduces the official Google RDF dump from
  about 3B triples to around 880M (or from 400GB uncompressed to about 60GB)
- relatively small deployment size: the English-only view of TAC-KBP's BaseKB
  Gold can be stored in an 11.5 GB index directory which can be efficiently
  searched by a standard desktop machine without huge amounts of memory
- Powerful ranked Lucene free text and fuzzy search on Freebase's description/
  name/label/etc. text fields combined with hard restrictions on fields such as types.
- simple command-line API to explore the data and extract relevant views
  (e.g., all subjects of type person)
- simple Python interface for easy integration and interactive use

Caveats:

- this is not a replacement for a triple store: while many interesting and
  useful queries can be expressed and executed via Lucene, it is not a database
  system and some more complex SPARQL queries can only be emulated via code
  (which might not be very efficient).  There is also no inference.
- the toolkit is an early prototype and does not yet have a fully mature set
  of features and APIs


Pre-built indexes:

As of version 1.2.0, a couple of pre-built indexes are available for download so
you can get started right away without having to build your own. See below on
how to download and install them.


Custom index generation proceeds in two phases:

(1) Pre-processing and shrinking: this phase abbreviates and normalizes URIs
so that they can be easily referenced in Lucene queries, and ignores unwanted
triples (customized by a number of "ignore-*' files).  In particular:
- common namespace prefixes such as http://rdf.freebase.com/ns/ and
  http://rdf.basekb.com/ns/ become short prefixes such as `f_' or `rs_', etc.
- redundant triples (e.g., <f_type.object.type> and <f_type.type.instance>)
  are eliminated (those are already eliminated from BaseKB)
- triples referencing ignored language strings are eliminated
- triples referencing ignored language Wikipedia pages are eliminated
- a number of other useless (for your task) predicates such as ISBN numbers,
  MusicBrainz track listings, various keys, etc. are eliminated

(2) The normalized, shrunk and sorted triples file gets indexed via Lucene.
Each subject and all of its predicates become a Lucene document.  Each
predicate and its values become stored fields in the document.  A small set of
those fields (customizable) are also indexed to allow efficient querying for
names and variants, descriptions, type fields, etc.  Lucene efficiently
compresses all field data, which results in a relatively small index size.


System Requirements
===================

Java 1.7 and the following libraries which have also been provided
already in the lib directory:

- lucene-5.2.1/core/lucene-core-5.2.1.jar
- lucene-5.2.1/analysis/common/lucene-analyzers-common-5.2.1.jar
- lucene-5.2.1/queryparser/lucene-queryparser-5.2.1.jar
- args4j/2.0.23/args4j-2.0.23.jar
- openrdf-sesame-2.8.4-onejar.jar

Different versions might work as well, but your milage may vary.  All
development and scripts have been developed and tested under Linux
only (specifically openSUSE).  They should generally work on MacOS as
well but will definitely require adaptation for Windows.

If you want to use the Python interface, you will need Python 2.7 and
a compatible version of the jnius package.


Installation
============

- Unpack the tar file somewhere where you have a good amount of disk space
  available (50GB or more are recommended), lets call this $FBT_HOME
- If not already provided, copy the required Java libraries to the $FBT_HOME/lib directory


Using Pre-built Indexes
=======================

To get started quickly and to let you experiment right away, you can download
the indexes used in the examples below from this location:

  https://drive.google.com/drive/folders/0B5Cp0viUdlRxUnl3M0V2UWV6VzQ

To install them, simply copy the downloaded archive(s) to the $FBT_HOME/data
directory and unpack them there.  IMPORTANT: make sure to rename away any
pre-existing index directories there you might have built yourself to not lose
any information.  Unpacking should create one or more of the following
directories depending on which indexes you downloaded:

  $FBT_HOME/data/basekb-gold-jan-2015.shrink.sort.index
  $FBT_HOME/data/basekb-gold-jan-2015.trilingual.shrink.sort.ml.index

Then make sure the LUCENE_INDEX variable in your config.dat file points to the
proper index directory (the distribution defaults already use the proper names).
At this point you should be able to run queries, test the index with something
like this:

  % ./fbt-lookup.sh -q f_m.0h54qv8 -v | head
  Loading index...
  Run time: setup=638ms, query=40ms, display=0ms
  f_m.0h54qv8:
      f_common.topic.article: f_m.0h54qvd
      f_common.topic.description: "Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins. He was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists' Club from 1861 to 1881. He especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum's collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en
      f_common.topic.description: "Rev. Henry Hugh Higgins fue un botánico, briólogo, clérigo, geólogo, y curador inglés. Fue inspector de Escuelas Nacionales, de Liverpool, de 1842 a 1848. Desde 1853 a 1886, fue capellán del Asilo Rainhill, también de Liverpool. Trabajó especialmente en las colecciones Ravenhead, compuestas sobre todo de flora del Carbonífero superior Langsettiano, peces y bivalvos y restos de insectos. El colector fue Liverpool Museum voluntarios reverendo Henry Higgins Hugh y la reunión se hizo desde un sitio de recolecta fue con la construcción del ferrocarril en 1870, donde se exponen dos vetas carboníferas conocidas como el Alto y el Bajo Ravenhead. La mayor parte de las colecciones, sobrevivió al bombardeo de mayo 1941 que prácticamente destruyó al museo de Liverpool; mas por desgracia, todo el material de Ravenhead se perdió en el incendio."@es
      f_common.topic.notable_for: f_g.125crzjzl
      f_common.topic.notable_types: f_m.022tfrk
      f_common.topic.topic_equivalent_webpage: http://es.wikipedia.org/wiki/Henry_Hugh_Higgins
      f_common.topic.topic_equivalent_webpage: http://es.wikipedia.org/wiki/index.html?curid=4127666
      f_common.topic.topic_equivalent_webpage: we_Henry_Higgins_(botanist)
      f_common.topic.topic_equivalent_webpage: we_index.html?curid=32997517


Building Custom Indexes
=======================

If the pre-built indexes don't work for your needs, you can build your own using
the following process.


Shrinking the Triples File(s)

In the instructions below, we give examples for both the TAC-KBP BaseKB Gold
version as well as the full Google RDF dump.  Simply skip the parts for the KB
version you are not using.

The first time you run this, you should probably run everything in the
default configuration which eliminates all non-English language
information.  Once you have successfully created an index and find it
useful, you can change the configuration for your own needs (e.g., use
the tri-lingual setup, preserve other languages of interest, ignore
more or fewer predicates, etc.).  If you feel adventurous, you can
modify the config.dat file right away to suit your needs (see also the
customization section below).

To shrink the BaseKB Gold version that is used as the TAC-KBP reference
knowledge base, run the following commands (substitute the proper LDC KB data
directory):

  % cd $FBT_HOME
  % ./fbt-shrink-freebase.sh -o data/basekb-gold-jan-2015.shrink.sort.test.gz /data/LDC2015E42/data/*.nt.gz

This will do a small test batch of 1M triples.  You should inspect
data/basekb-gold-jan-2015.shrink.sort.test.gz to make sure it looks ok
(properly substituted namespaces, filtered triples, ignored languages, etc.)
Once you are confident of that, you can run the script on the full data by
giving the -f flag:

  % ./fbt-shrink-freebase.sh -f -o data/basekb-gold-jan-2015.shrink.sort.gz /data/LDC2015E42/data/*.nt.gz

This should take approximately 3.5 hours and produce an approximately 7 GB result file.


If you are want to work with the full Google RDF dump, run these commands
(substitute the proper input data file):

  % cd $FBT_HOME
  % ./fbt-shrink-freebase.sh -o data/freebase-rdf-latest.shrink.sort.test.gz data/freebase-rdf-latest.gz

This will do a small test batch of 1M triples.  You should inspect
data/freebase-rdf-latest.shrink.sort.test.gz to make sure it looks ok
(properly substituted namespaces, filtered triples, etc.)  Once you are
confident of that, you can run the script on the full data by giving the -f flag:

  % ./fbt-shrink-freebase.sh -f -o data/freebase-rdf-latest.shrink.sort.gz data/freebase-rdf-latest.gz

This should take approximately 6 hours to run and produce an approximately 8GB result file.


Building the Lucene Index

Once you have an appropriately pre-processed and shrunk triples file, you can
build the Lucene index for it.  For the TAC-KBP BaseKB using the file names we
used above, this would look like this:

  % cd $FBT_HOME
  % java -cp '.:bin:lib/*' edu.isi.kres.FreebaseTools -T data/basekb-gold-jan-2015.shrink.sort.gz -I data/basekb-gold-jan-2015.shrink.sort.index -c index -f -o -nn -v

This should take approximately 50 minutes and build an 11GB index directory.
After that has finished, you can test it like this which should display a
single record about Henry Higgins:

  % java -cp '.:bin:lib/*' edu.isi.kres.FreebaseTools -I data/basekb-gold-jan-2015.shrink.sort.index -c lookup -q f_m.0h54qv8 -v

You can also use the following script (which is configured to use
data/basekb-gold-jan-2015.shrink.sort.index as the index directory, if
you used a different name, you have to edit the LUCENE_INDEX variable
in the config.dat file accordingly):

  % ./fbt-lookup.sh -q f_m.0h54qv8 -v


To build the index for a preprocessed full Google RDF dump, this looks very
similar just with different file and directory names:

  % java -cp '.:bin:lib/*' edu.isi.kres.FreebaseTools -T data/freebase-rdf-latest.shrink.sort.gz -I data/freebase-rdf-latest.shrink.sort.index -c index -f -o -nn -v

This will take approximately 70 minutes and build a 13GB index directory.  To
test it run the following query:

  % java -cp '.:bin:lib/*' edu.isi.kres.FreebaseTools -I data/freebase-rdf-latest.shrink.sort.index -c lookup -q f_m.0h54qv8 -v

To use the `fbt-lookup.sh' and `fbt-search.sh' scripts with this
index, you have to edit LUCENE_INDEX in config.dat to point to the
appropriate index directory.


Once the created index(es) are functioning appropriately, you can
optionally delete the pre-processed and shrunk triples files they are
based on.  However, it might be useful to keep them for index
recreation later (e.g., with a different set of indexed predicates),
since they do take significant time to regenerate.


Encoding and Normalization

The N-Triples files use UTF-8 character encoding, however, they also encode
certain Unicode characters as well as newlines, etc.  as \uNNNN, \n, \r, \",
etc. escape sequences.  During indexing, escaped character sequences are
translated back into UTF-8 before they get analyzed and stored by Lucene.

Newlines can be normalized to spaces by supplying the -nn command line option.
This is useful to avoid line breaks when querying for predicates.  Newline
normalization can be done during indexing time, which is more efficient but in
which case the newlines cannot be recovered later.  It can also be done during
query time only by supplying the -nn option then (for the small cost of some
extra run-time computation).


Customization

The following files are used to control the pre-processing and shrinking phase
for an English-only version of the KB:

 - ignore-langs.lst
 - ignore-preds.lst
 - ignore-values.lst

The following files are used for a tri-lingual English/Spanish/Chinese
version:

 - ignore-langs-trilingual.lst
 - ignore-preds-trilingual.lst
 - ignore-values-trilingual.lst

Edit the config.dat file to use the appropriate ignore files.  If you need
your own versions, make your own copies and edit them to ignore more or less.
Be very careful, since these are pattern files that contain TAB characters.
Make sure the TABs are preserved in your edited versions.

Namespace abbreviation is currently hardcoded into the
fbt-abbrev-basekb-uris.sh and fbt-abbrev-freebase-uris.sh scripts.  Edit them
if you need additional or different abbreviations.  If you change prefixes,
various other files such as ignore-preds.lst, indexed-preds.list, etc. will
need to be adjusted also.  After you are done customizing these files, rerun
the fbt-shrink-freebase.sh script.  Note that the official Freebase and BaseKB
use different namespace prefixes as well as slightly different predicate names.
The pre-configured ignore files are designed to handle both variants.

The following file controls which predicates are indexed by Lucene:

 - indexed-preds.lst

The predicates in this file can be referenced in queries, all other predicates
are stored, printed and retrievable, but they cannot be directly queried on.
If there are additional predicates you want to use in your Lucene queries, add
them to this list.  If you make changes to this list, you need to rerun the
index creation step.


Indexing Options

The Lucene index supports three basic functions:
(1) storing of all subject keys and their associated predicates and values
(2) lookup of all predicate values for a specific subject based on its key
(3) searching for subjects based on matching of textual information such as
the words in a label, name, alias, description, etc. using the Lucene query
language.

To support the search functionality, predicates with text data must be indexed
appropriately during index creation time.  The toolkit supports three
different text indexing options that can be used individually or in
combination:

(1) per-predicate indexing: in this mode, each indexed predicate that is
text-valued (e.g., rs_label or f_type.object.name) becomes a searchable Lucene
field of the same name.  This mode allows very fine-grained Lucene search
expressions that search for different information in different fields, for
example: 'rs_label:"Barack Obama" AND f_common.topic.description:President'.
This type of indexing is selected with the -ip option during index creation.

(2) text indexing: in this mode, the text values of all indexed, text-valued
predicates of a subject are indexed via a single, combined "text" field (they
are still stored separately for fine-grained lookup via a subject).  This
allows search that spans information from all text values about a subject.
For example: 'text:"President Barack Obama"'.

(3) language indexing: in this mode, text values with language designations
are indexed via separate language-qualified fields.  This will only apply to
languages for which specific analyzers have been configured in config.dat
(more on that below).  For example: 'rs_label@zh:巴拉克·歐巴馬 +rs_label@en:Barack'
This allows for usage of language-appropriate Lucene text analyzers instead of
using a one-size-fits-all approach.  Language indexing applies to both
per-predicate and text indexing, for example: 'text@zh:巴拉克·歐巴馬'.

Per-predicate and text indexing can be used individually or in combination.
Language indexing will apply to one or both of them for the languages
configured in config.dat.  The default setup without any -ip/-it/-il options
uses all three of them together as the default.  They can be down-selected to
save some index space by specifying the desired options individually.

Lucene uses a standard TF-IDF ranking scheme to select the most relevant matches,
however, there are some normalizations (e.g., for document length) that might
produce counter-intuitive results.  Short fields will be boosted to avoid their
being shadowed by long fields that contain many relevant search terms.  This can
produce ranking issues with "text" fields from subjects that have lots of
information stored about them (such as celebrities).  These can usually be
remedied by adding per-field restrictions (as long as the index was created
with -ip).  The -d debug option provides explanation about how Lucene computed
the rank score for each returned result.


Multi-Lingual Shrinking, Indexing and Search

In the default mono-lingual setup, only English text values are preserved and
all indexed text predicates such as rs_label, f_type.object.name, etc. are
analyzed using LUCENE_INDEX_ANALYZER_DEFAULT, which by default uses Lucene's
StandardAnalyzer.  In a multi-lingual setup, predicate values in multiple
languages of interest are preserved and language-specific Lucene analyzers are
used for those values to enable proper tokenization and search.

For example, to enable a tri-lingual English. Spanish, Chinese configuration,
uncomment the variables in the tri-lingual section of config.dat.  This will
enable the proper ignore files for the shrinking step.  Run it just like above
but with different file names to indicate the tri-lingual setup.  For example
(we only show the BaseKB scenario here, substitute the proper LDC data directory):

  % ./fbt-shrink-freebase.sh -f -o data/basekb-gold-jan-2015.trilingual.shrink.sort.gz /data/LDC2015E42/data/*.nt.gz

Once this has successfully completed, you can build the Lucene index like this
just like in the mono-lingual setting:

  % java -cp '.:bin:lib/*' edu.isi.kres.FreebaseTools -T data/basekb-gold-jan-2015.trilingual.shrink.sort.gz -I data/basekb-gold-jan-2015.trilingual.shrink.sort.ml.index -c index -f -o -nn -v

The only difference here is the configuration of different analyzers
for one or more languages in config.dat.  For each language of interest,
a specific analyzer is configured via a LUCENE_INDEX_ANALYZER_xx
variable where "xx" is the up-cased code of the particular language.
For example, LUCENE_INDEX_ANALYZER_ZH defines the anlyzer to be used
for Chinese text strings.  Analyzers are specified with their full
Java pathnames and have to be visible in the classpath at run time.
By default, language variants (e.g., zh-TW, es-419, etc.) are all
handled by the analyzer configured for the root language.  If necessary,
this can be overridden by defining an analyzer for a language variant
of interest.

In multi-lingual mode, indexed predicates are indexed separately for each
language of interest.  For example, if rs_label is an indexed predicate,
Lucene fields rs_label@en, rs_label@es and rs_label@zh will be created in the
tri-lingual setup to be able to query values of specific languages.  These
language-qualified predicates are not stored, they are only used for
querying.  For example, the following query can be used to look for Barack
Obama using a Chinese search string (assuming LUCENE_INDEX in config.dat
points to the appropriate tri-lingual index):

  ./fbt-search.sh -s rs_label@zh -q '巴拉克·歐巴馬 +r_type:f_people.person' -v

For this to work, the shell variable LANG must point to an appropriate UTF8
encoding for your locale.  Use the -v option to make sure Lucene reports the
proper UTF-8 encoded characters in its "Searching for:" string.

You can search across different language fields simultaneously using Lucene's
field-specific query syntax.  For example:

  ./fbt-search.sh -q 'rs_label@zh:巴拉克·歐巴馬 +rs_label@en:Barack +r_type:f_people.person'

The multi-lingual indexing and search strategy implemented here is just one
example of how this might be done.  Instead, one could build different
mono-lingual indexes for each language of interest, or build a tri-lingual
index that uses the same analyzer across fields from different languages, etc.
What's the best option will require some experimentation and depend on the
particular application.


Performance Notes
=================

The FreebaseTools command line utility used in examples above and below
incurs a standard startup overhead of about 0.7 seconds to load the index.  If
you need to run many queries, you should use the toolkit programmatically, so
you only pay for index loading once.  Use the -v option to see times for
index loading, querying and result display.  Using the Python API is a good
and simple way to only pay for index loading once.

The more predicates of a retrieved subject you display, the more expensive the
operation becomes, since more data needs to be accessed and decompressed.
Therefore, always restrict display to data you actually need (e.g., with the
-p option or by using the appropriate API function).

Queries with very large result sets (e.g., all subjects of type person) are
moderately expensive.  To retrieve the 3M keys of all person entities takes
about 2 minutes on my desktop.

Lucene generally shines with queries that produce small to medium-size result
sets with good ranking discrimination.  If you ask for all "Smith"'s you will
get large result sets (still pretty fast though) that will require significant
filtering to get to what you want.

The smaller the index and the fewer different predicates there are, the better
performance will be.  Therefore, try to ignore as much as possible for your
application during the shrinking phase.

Installing the index files on an SSD drive further improves performance, since
random access read times will be significantly shorter.


Troubleshooting
===============

- make sure things work as exprected with the provided pre-built index files
- before you make any modifications, try to run the test and full versions
  of the shrinking and indexing steps using the default configuration -
  preferably with the LDC BaseKB files which are very stable
- if you are having problems with the fbt-shrink-freebase.sh script, try to
  run the pipeline at the end of the script directly in your shell
  (substituting the appropriate abbreviation script and input/output file names).
- if you have problems with Unicode search strings, make sure the LANG
  variable is set to a proper UTF8 encoding (e.g., en_US.UTF-8 for a US locale)
- use the -v and -d options to get more information about configuration options
  and search strings used


Questions, Suggestions and Comments
===================================

Send mail to Hans Chalupsky <hans at isi dot edu>


Example Lookup Queries
======================

In the examples below we use the fbt-lookup.sh script which is preconfigured
to use data/basekb-gold-jan-2015.shrink.sort.index as the index.  If you
generated an index with a different name, you have to edit the config.dat
file first to point the LUCENE_INDEX variable to the right index directory.

Lookup queries are the most basic of operations where we look up information
about a subject based on its key (or MID).  For example:

  % cd $FBT_HOME
  % ./fbt-lookup.sh -q f_m.0h54qv8
  f_m.0h54qv8:
      f_common.topic.article: f_m.0h54qvd
      f_common.topic.description: "Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins. He was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists' Club from 1861 to 1881. He especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum's collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en
      f_common.topic.notable_for: f_g.125crzjzl
      f_common.topic.notable_types: f_m.022tfrk
      f_common.topic.topic_equivalent_webpage: we_Henry_Higgins_(botanist)
      f_common.topic.topic_equivalent_webpage: we_index.html?curid=32997517
      f_people.deceased_person.date_of_death: "1893"^^<http://www.w3.org/2001/XMLSchema#gYear>
      f_people.person.date_of_birth: "1814"^^<http://www.w3.org/2001/XMLSchema#gYear>
      f_people.person.gender: f_m.05zppz
      f_people.person.profession: f_m.036n1
      f_type.object.name: "Henry Higgins"@en
      fk_key.wikipedia.en: "Henry_Higgins_$0028botanist$0029"
      fk_key.wikipedia.en: "Henry_Hugh_Higgins"
      fk_key.wikipedia.en_id: "32997517"
      fk_key.wikipedia.en_title: "Henry_Higgins_$0028botanist$0029"
      r_type: f_common.topic
      r_type: f_people.deceased_person
      r_type: f_people.person
      rs_label: "Henry Higgins"@en


The -v option adds some additional progress and timing information.  These are
printed to stderr while results are printed to stdout, so they can be
redirected to a file without being polluted by progress information.

  % ./fbt-lookup.sh -q f_m.0h54qv8 -v
  Loading index...
  f_m.0h54qv8:
      f_common.topic.article: f_m.0h54qvd
      f_common.topic.description: "Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins. He was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists' Club from 1861 to 1881. He especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum's collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en
      f_common.topic.notable_for: f_g.125crzjzl
      f_common.topic.notable_types: f_m.022tfrk
      f_common.topic.topic_equivalent_webpage: we_Henry_Higgins_(botanist)
      f_common.topic.topic_equivalent_webpage: we_index.html?curid=32997517
      f_people.deceased_person.date_of_death: "1893"^^<http://www.w3.org/2001/XMLSchema#gYear>
      f_people.person.date_of_birth: "1814"^^<http://www.w3.org/2001/XMLSchema#gYear>
      f_people.person.gender: f_m.05zppz
      f_people.person.profession: f_m.036n1
      f_type.object.name: "Henry Higgins"@en
      fk_key.wikipedia.en: "Henry_Higgins_$0028botanist$0029"
      fk_key.wikipedia.en: "Henry_Hugh_Higgins"
      fk_key.wikipedia.en_id: "32997517"
      fk_key.wikipedia.en_title: "Henry_Higgins_$0028botanist$0029"
      r_type: f_common.topic
      r_type: f_people.deceased_person
      r_type: f_people.person
      rs_label: "Henry Higgins"@en
  
  Run time: setup=740ms, query=36ms, display=0ms


Instead of displaying all the information about a subject, we can restrict
output to specific predicates.  For example:

  % ./fbt-lookup.sh -q f_m.0h54qv8 -p 'f_type.object.name, f_common.topic.description, r_type'
  f_m.0h54qv8	f_type.object.name	"Henry Higgins"@en
  f_m.0h54qv8	f_common.topic.description	"Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins. He was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists' Club from 1861 to 1881. He especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum's collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en
  f_m.0h54qv8	r_type	f_common.topic
  f_m.0h54qv8	r_type	f_people.deceased_person
  f_m.0h54qv8	r_type	f_people.person


We can use predicate chains to access information that is one or more links
away from a subject.  For example, in the query below we lookup the labels of
the objects defining the person's gender and professions:

  % ./fbt-lookup.sh -q f_m.0h54qv8 -p 'f_type.object.name, f_common.topic.description, f_people.person.gender>rs_label, f_people.person.profession>rs_label, r_type'
  f_m.0h54qv8	f_type.object.name	"Henry Higgins"@en
  f_m.0h54qv8	f_common.topic.description	"Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins. He was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists' Club from 1861 to 1881. He especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum's collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en
  f_m.0h54qv8	f_people.person.gender	f_m.05zppz	rs_label	"Male"@en
  f_m.0h54qv8	f_people.person.profession	f_m.036n1	rs_label	"Geologist"@en
  f_m.0h54qv8	f_people.person.profession	f_m.036n1	rs_label	"Geologist"@en-GB
  f_m.0h54qv8	r_type	f_common.topic
  f_m.0h54qv8	r_type	f_people.deceased_person
  f_m.0h54qv8	r_type	f_people.person

Predicate chains trigger individual term lookup queries for the objects that
are followed.  This is ok for small result sets but will become expensive for
larger sets.  A database or triple store would run a more efficient join
operation for those, but that is one of the operations not available to us
with Lucene.  Note, that if a predicate or chain is not defined for a
particular subject, a null value will be displayed.


We can lookup more than one object at the same time, which amortizes the index
loading time.  For example:

  % ./fbt-lookup.sh -q f_m.0h54qv8,f_m.03wwvwm -p 'f_type.object.name,r_type'
  f_m.0h54qv8	f_type.object.name	"Henry Higgins"@en
  f_m.0h54qv8	r_type	f_common.topic
  f_m.0h54qv8	r_type	f_people.deceased_person
  f_m.0h54qv8	r_type	f_people.person
  f_m.03wwvwm	f_type.object.name	"John R Owens"@en
  f_m.03wwvwm	r_type	f_common.topic
  f_m.03wwvwm	r_type	f_people.person


In Unix we can also pipe in subject keys from stdin, one per line giving `-'
as the query string.  Those might have been generated by a different query or
other operation.  For example:

  % echo 'f_m.0h54qv8\nf_m.03wwvwm' | ./fbt-lookup.sh -q - -p 'f_type.object.name,r_type'
  f_m.0h54qv8	f_type.object.name	"Henry Higgins"@en
  f_m.0h54qv8	r_type	f_common.topic
  f_m.0h54qv8	r_type	f_people.deceased_person
  f_m.0h54qv8	r_type	f_people.person
  f_m.03wwvwm	f_type.object.name	"John R Owens"@en
  f_m.03wwvwm	r_type	f_common.topic
  f_m.03wwvwm	r_type	f_people.person


Example Search Queries
======================

In the examples below we use the fbt-search.sh script which is preconfigured
to use data/basekb-gold-jan-2015.shrink.sort.index as the index.  If you
generated an index with a different name, you have to edit the config.dat
file first to point the LUCENE_INDEX variable to the right index directory.

Search queries can be used to find subjects in the knowledge graph based on
matching of text strings such as names, labels and descriptions.  They use the
Lucene query syntax to search the index (see
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html) for more
information on Lucene queries.  For example:

  % cd $FBT_HOME
  % ./fbt-search.sh -q 'Claude AND Parsons AND r_type:f_people.person' -v
  Loading index...
  Index contains 107692853 documents
  Searching for: +claude +parsons +r_type:f_people.person
  Found 1 matching subject(s)
  Printing results...
  f_m.02rk97l: [score=9.111403]
      f_common.topic.alias: "Claude Parsons"@en
      f_common.topic.article: f_m.02rk97p
      f_common.topic.description: "Claude VanCleve Parsons was a U.S. Representative from Illinois. Born on a farm near McCormick, Pope County, Illinois, Parsons attended the public schools. He taught in the rural schools of Pope County, Illinois from 1914 to 1922. He was graduated from Southern Illinois State Normal School at Carbondale in 1923. He moved to Golconda, Illinois, in 1922 to become county superintendent of schools, in which capacity he served until 1930. He was also engaged as an editor and newspaper publisher from 1924 to 1930. Parsons was elected on November 4, 1930, as a Democrat to the Seventy-first Congress to fill the vacancy caused by the resignation of Thomas S. Williams and on the same day was elected to the Seventy-second Congress. He was reelected to the Seventy-third and to the three succeeding Congresses and served from November 4, 1930, to January 3, 1941. He was an unsuccessful candidate for reelection in 1940 to the Seventy-seventh Congress. He was appointed first assistant administrator of the United States Housing Authority February 14, 1941, and served until his death in Washington, D.C., May 23, 1941. He was interred in Zion Church Cemetery, near Ozark, Illinois."@en
      f_common.topic.notable_for: f_g.125cz1ny1
      f_common.topic.notable_types: f_m.05kpwk1
      f_common.topic.topic_equivalent_webpage: http://bioguide.congress.gov/scripts/biodisplay.pl?index=P000086
      f_common.topic.topic_equivalent_webpage: we_Claude_V._Parsons
      f_common.topic.topic_equivalent_webpage: we_index.html?curid=11586221
      f_government.u_s_congressperson.thomas_id: "P000086"
      f_people.deceased_person.date_of_death: "1941-05-23"^^<http://www.w3.org/2001/XMLSchema#date>
      f_people.person.date_of_birth: "1895-10-07"^^<http://www.w3.org/2001/XMLSchema#date>
      f_people.person.gender: f_m.05zppz
      f_type.object.name: "Claude V. Parsons"@en
      fk_key.base.uspolitician.thomas_id: "P000086"
      fk_key.en: "claude_v_parsons"
      fk_key.wikipedia.en: "Claude_Parsons"
      fk_key.wikipedia.en: "Claude_V$002E_Parsons"
      fk_key.wikipedia.en_id: "11586221"
      fk_key.wikipedia.en_title: "Claude_V$002E_Parsons"
      r_type: f_common.topic
      r_type: f_government.politician
      r_type: f_government.u_s_congressperson
      r_type: f_people.deceased_person
      r_type: f_people.person
      rs_label: "Claude V. Parsons"@en
  
  Run time: setup=764ms, query=144ms, display=9ms

In this query we looked for objects with name "Claude Parsons" that were also
of type person.  All query terms in a Lucene query are matched against a
field.  If no field is specified for a field, a default search field is used.
In our implementation, the default field is `rs_label' which can be changed in
config.dat or via the -s command line option which will override what's in the
configuration file.

To restrict a field different from the default field, the syntax `field:value'
can be used, which we did for the type restriction.  Remember that in our
index implementation, each predicate becomes a field of the same name in the
Lucene document describing a particular subject key.  All predicates about a
subject are stored so we can see their values when a subject is accessed, but
only a small number are indexed (defined in indexed-preds.lst).  `r_type' is
one of the indexed predicates which is the reason we can use it in the query,
`f_people.person.gender' is not.  So, adding a clause restricting the gender
will make the query fail:

  % ./fbt-search.sh -q 'Claude AND Parsons AND r_type:f_people.person AND f_people.person.gender:f_m.05zppz' -v
  Loading index...
  Index contains 107692853 documents
  Searching for: +claude +parsons +r_type:f_people.person +(f_people.person.gender:f_m f_people.person.gender:05zppz)
  Found 0 matching subject(s)
  Printing results...
  Run time: setup=806ms, query=131ms, display=0ms

Lucene didn't know how to handle the f_people.person.gender field, so it
tokenized it as a text field.  If we wanted to restrict queries based on
people's gender, we can simply add f_people.person.gender to indexed-preds.lst
and rebuild the index.


In the query below we use a more complex construct to look for matches in both
the `rs_label' default field as well as in the `f_common.topic.description'
field.  This time we get additional matches that mention both Claude and
Parsons in their description string, however, their scores are significantly
lower than for the first (correct) match.  Note that we also could have used
Lucene's proximity search to look for matches to Claude and Parsons within a
maximum distance.  Also note, that similar to lookup queries, we can use -p to
restrict what to print for our results:

  % ./fbt-search.sh -q '((Claude AND Parsons) OR (f_common.topic.description:Claude AND f_common.topic.description:Parsons)) AND r_type:f_people.person' -p 'f_type.object.name, f_common.topic.description' -v
  Loading index...
  Index contains 107692853 documents
  Searching for: +((+claude +parsons) (+f_common.topic.description:claude +f_common.topic.description:parsons)) +r_type:f_people.person
  Found 5 matching subject(s)
  Printing results...
  f_m.02rk97l	7.690588	f_type.object.name	"Claude V. Parsons"@en
  f_m.02rk97l	7.690588	f_common.topic.description	"Claude VanCleve Parsons was a U.S. Representative from Illinois. Born on a farm near McCormick, Pope County, Illinois, Parsons attended the public schools. He taught in the rural schools of Pope County, Illinois from 1914 to 1922. He was graduated from Southern Illinois State Normal School at Carbondale in 1923. He moved to Golconda, Illinois, in 1922 to become county superintendent of schools, in which capacity he served until 1930. He was also engaged as an editor and newspaper publisher from 1924 to 1930. Parsons was elected on November 4, 1930, as a Democrat to the Seventy-first Congress to fill the vacancy caused by the resignation of Thomas S. Williams and on the same day was elected to the Seventy-second Congress. He was reelected to the Seventy-third and to the three succeeding Congresses and served from November 4, 1930, to January 3, 1941. He was an unsuccessful candidate for reelection in 1940 to the Seventy-seventh Congress. He was appointed first assistant administrator of the United States Housing Authority February 14, 1941, and served until his death in Washington, D.C., May 23, 1941. He was interred in Zion Church Cemetery, near Ozark, Illinois."@en
  f_m.06x5hk	2.0857942	f_type.object.name	"T. Claude Ryan"@en
  f_m.06x5hk	2.0857942	f_common.topic.description	"Tubal Claude Ryan was an Irish-American aviator born in Parsons, Kansas. Ryan was best known for founding several airlines and aviation factories."@en
  f_m.025xzn6	1.4879408	f_type.object.name	"Curly Putman"@en
  f_m.025xzn6	1.4879408	f_common.topic.description	"Claude "Curly" Putman, Jr. is an American songwriter, based in Nashville. His biggest success was "Green, Green Grass of Home", which was covered by Roger Miller, Elvis Presley, Kenny Rogers, Johnny Darrell, Gram Parsons, Joan Baez, Jerry Lee Lewis, The Grateful Dead, Johnny Cash, Roberto Leal, Merle Haggard, Bobby Bare, Joe Tex, Nana Mouskouri, and Tom Jones. The Paul McCartney & Wings hit "Junior's Farm" was inspired by their short stay at Putman's farm in rural Wilson County, Tennessee in 1974."@en
  f_m.015wwg	1.402533	f_type.object.name	"James Burton"@en
  f_m.015wwg	1.402533	f_common.topic.description	"James Burton is a film actor."@en
  f_m.015wwg	1.402533	f_common.topic.description	"James Burton is an American guitarist. A member of the Rock and Roll Hall of Fame since 2001, Burton has also been recognized by the Rockabilly Hall of Fame and the Musicians Hall of Fame and Museum. Critic Mark Demming writes that "Burton has a well-deserved reputation as one of the finest guitar pickers in either country or rock ... Burton is one of the best guitar players to ever touch a fretboard." James Burton is also known as the "Master of the Telecaster." Since the 1950s, Burton has recorded and performed with an array of notable singers, including Bob Luman, Dale Hawkins, Ricky Nelson, Elvis Presley, Johnny Cash, Merle Haggard, Glen Campbell, John Denver, Gram Parsons, Emmylou Harris, Judy Collins, Jerry Lee Lewis, Claude King, Elvis Costello, Joe Osborn, Roy Orbison, Joni Mitchell, Vince Gill, Suzi Quatro and Allen "Puddler" Harris."@en
  f_m.01qf23	1.2317178	f_type.object.name	"Susan McMaster"@en
  f_m.01qf23	1.2317178	f_common.topic.description	"Susan McMaster is a Canadian poet, literary editor, spoken word/performance poet, and 2011-12 President of the League of Canadian Poets. She lives in Ottawa, Ontario. Her recent poetry books are Paper Affair: Poems Selected and New, Pith & Wry: Canadian Poetry, and Crossing Arcs: Alzheimer's, My Mother, and Me, which was a finalist for the 2010 Acorn-Plantos People's Poetry Prize, the 2010 Ottawa Book Awards, and the 2010 Archibald Lampman Poetry Prize. She is the author of several wordmusic collections, performance poetry recordings, and scripts; has edited poetry anthologies and series; and was the founding editor of the national feminist and art magazine Branching Out. McMaster was an original member of the intermedia group First Draft, with members including Andrew McClure, Colin Morton, Alrick Huebener, Roberta Huebener, Claude Dupuis, Peter Thomas, and David Parsons. Together, they recorded, published, and performed some 40 times across Canada in the 1980s. Since 1996, she has been the wordsmith in Geode Music & Poetry, making four spoken word and music recordings with Jennifer Giles on keyboards, Alrick Huebener on bass, Gavin McLintock on sax, and friends, including Dave Broscoe, Jamie Gullikson, Mike Essoudry, Petr Cancura, Mark Molnar, John Higney, Linsey Wellman, Penn Kemp, Colin Morton, and Max Middle. She has performed and recorded with SugarBeat and Geode at 50-plus venues, including the Banff Centre, the National Library, the Kingston Fringe Jazz Festival, Rasputin's, the Blue Skies Music Festival, the Ottawa Folk Festival, the Elora Music Festival, Artscape, WordBeat, Morningside, Go, the National Arts Center Fourth Stage, and the Ottawa International Writers Festival, and has read and performed at festivals and venues in France and Italy."@en
  Run time: setup=779ms, query=287ms, display=12ms


The following query use predicate chain result printing to describe family
relationships of person's named Barack Obama (there are two in the KB, father
and son):

  % ./fbt-search.sh -q '+Barack +Obama +r_type:f_people.person' -p 'f_type.object.name, f_common.topic.alias, f_people.person.sibling_s>f_people.sibling_relationship.sibling>rs_label, f_people.person.spouse_s>f_people.marriage.spouse>rs_label, f_people.person.children>rs_label' -v
  Loading index...
  Index contains 107692853 documents
  Searching for: +barack +obama +r_type:f_people.person
  Found 2 matching subject(s)
  Printing results...
  f_m.02mjmr	10.13228	f_type.object.name	"Barack Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Bama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barack H. Obama II"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barack Hussein Obama II"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barack Hussein Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barack Hussein Obama, Jr."@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barack Obama II"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barak Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Barry"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"No Drama Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Obomber"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"President Barack H. Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"President Barack Hussein Obama II"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"President Barack Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"President Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Rock"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Sen Barack Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Sen. Barack Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"Senator Barack Obama"@en
  f_m.02mjmr	10.13228	f_common.topic.alias	"The One"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.044_q0m	f_people.sibling_relationship.sibling	f_m.03w9f63	rs_label	"Maya Soetoro-Ng"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0kv53ff	f_people.sibling_relationship.sibling	f_m.0kv53fg	rs_label	"George Obama"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0kv53fr	f_people.sibling_relationship.sibling	f_m.04vi1tj	rs_label	"Mark Okoth Obama Ndesandjo"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0n4sqh8	f_people.sibling_relationship.sibling	f_m.04ct7vv	rs_label	"Malik Abongo Obama"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0n4sqk2	f_people.sibling_relationship.sibling	f_m.0n4sqk3	rs_label	"Bernard Obama"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0n4sqkd	f_people.sibling_relationship.sibling	f_m.04vy21h	rs_label	"David Ndesandjo"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0n4sqkk	f_people.sibling_relationship.sibling	f_m.0n4sqkl	rs_label	"Abo Obama"@en
  f_m.02mjmr	10.13228	f_people.person.sibling_s	f_m.0n4sql0	f_people.sibling_relationship.sibling	f_m.0h4np5z	rs_label	"Auma Obama"@en
  f_m.02mjmr	10.13228	f_people.person.spouse_s	f_m.02nqglv	f_people.marriage.spouse	f_m.025s5v9	rs_label	"Michelle Obama"@en
  f_m.02mjmr	10.13228	f_people.person.children	f_m.02nqgyw	rs_label	"Natasha Obama"@en
  f_m.02mjmr	10.13228	f_people.person.children	f_m.0gh6dh1	rs_label	"Malia Ann Obama"@en
  f_m.03qccxj	10.13228	f_type.object.name	"Barack Obama Sr."@en
  f_m.03qccxj	10.13228	f_common.topic.alias	"Barack Hussein Obama"@en
  f_m.03qccxj	10.13228	f_common.topic.alias	"Barack Obama, Sr."@en
  f_m.03qccxj	10.13228	f_common.topic.alias	"Baraka Obama"@en
  f_m.03qccxj	10.13228	f_common.topic.alias	"barack_obama_sr"@en
  f_m.03qccxj	10.13228	f_people.person.sibling_s	f_m.0k6m6kc	f_people.sibling_relationship.sibling	f_m.04y93y0	rs_label	"Zeituni Onyango"@en
  f_m.03qccxj	10.13228	f_people.person.sibling_s	f_m.0wzjdyf	f_people.sibling_relationship.sibling	f_m.0wzj04v	rs_label	"Sarah Obama"@en
  f_m.03qccxj	10.13228	f_people.person.spouse_s	f_m.040fvqp	f_people.marriage.spouse	f_m.03hfxq_	rs_label	"Ann Dunham"@en
  f_m.03qccxj	10.13228	f_people.person.spouse_s	f_m.0j4l4y8	f_people.marriage.spouse	f_m.040fvrd	rs_label	"Ruth Nidesand"@en
  f_m.03qccxj	10.13228	f_people.person.spouse_s	f_m.0j4l4yg	f_people.marriage.spouse	f_m.040fvr1	rs_label	"Kezia Obama"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.02mjmr	rs_label	"Barack Obama"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.04ct7vv	rs_label	"Malik Abongo Obama"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.04vy1tj	rs_label	"Mark Okoth Obama Ndesandjo"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.04vy21h	rs_label	"David Ndesandjo"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.0h4np5z	rs_label	"Auma Obama"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.0kv53fg	rs_label	"George Obama"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.0n4sqk3	rs_label	"Bernard Obama"@en
  f_m.03qccxj	10.13228	f_people.person.children	f_m.0n4sqkl	rs_label	"Abo Obama"@en
  Run time: setup=787ms, query=147ms, display=59ms


By default, each search query retrieves at most 10 results.  That can be
changed to more or less with the -m parameter.  For example, to retrieve 5
arbitrary person keys, we can run the following:

  % ./fbt-search.sh -q r_type:f_people.person -p subject -m 5 -v
  Loading index...
  Index contains 107692853 documents
  Searching for: r_type:f_people.person
  Found 3073576 matching subject(s)
  Printing results...
  f_m.03cdrkk	4.5564413
  f_m.03cdrkx	4.5564413
  f_m.03cdrnp	4.5564413
  f_m.03cdrp0	4.5564413
  f_m.03cdrpc	4.5564413
  Run time: setup=1161ms, query=153ms, display=12ms


We can get all results by supplying -1 for -m.  For example, to retrieve all
3073576 person keys we can do this (which takes about 1.5 minutes):

  % ./fbt-search.sh -q r_type:f_people.person -p subject -m -1 -v > /tmp/persons.lst
  Loading index...
  Index contains 107692853 documents
  Searching for: r_type:f_people.person
  Found 3073576 matching subject(s)
  Collecting results...
  Printing results...
  Run time: setup=771ms, query=2883ms, display=87807ms

  % wc -l /tmp/persons.lst
  3073576 /tmp/persons.lst


Installing and Using the Python Interface
=========================================

Ensure FreebaseTools (FBT) is properly configured and works from the command
line by calling 'fbt-lookup.sh' or 'fbt-search.sh' using some of the examples
above.

The fbtools.py package was developed and tested with Python 2.7 only, for Python
3 your milage may vary.  It shouldn't be too hard to make it work with Python 3,
but that work remains to be done.

The fbtools.py packages calls the Java version of FBT via the jnius package, so
ensure you have that available on your system or install it via 'sudo pip
install jnius' or similar.  Make sure you install a version of jnius that works
with Python 2.7.

Point your PYTHONPATH to the $FBT_HOME directory, for example:

  % setenv PYTHONPATH ${FBT_HOME}:${PYTHONPATH}

Either permanently edit the 'freebaseToolsHome' and 'freebaseToolsConfig'
variables in fbtools.py or use fbt.configure immediately after import to
configure the package dynamically as shown in the example dialog below.


Usage Example

>>> import fbtools as fbt

# point to your own $FBT_HOME directory (and config file if it differs from the default):
>>> fbt.configure(home='/home/hans/projects/nlp/code/freebase', config='config.dat.dist')

# ignore the warning:
>>> fbi = fbt.FreebaseIndex()
WARN: problem reading config file `/home/hans/projects/nlp/code/freebase/config.dat.dist': null

>>> fbi.describe()
Number of indexed documents: 107692853
Configuration:
  SORT_DIR /home/hans/projects/nlp/code/freebase/sort
  IGNORE_LANGS /home/hans/projects/nlp/code/freebase/ignore-langs.lst
  IGNORE_PREDS /home/hans/projects/nlp/code/freebase/ignore-preds.lst
  IGNORE_VALUES /home/hans/projects/nlp/code/freebase/ignore-values.lst
  LUCENE_DEFAULT_FIELD rs_label
  LUCENE_INDEX /home/hans/projects/nlp/code/freebase/data/basekb-gold-jan-2015.shrink.sort.index
  LUCENE_INDEXED_PREDS /home/hans/projects/nlp/code/freebase/indexed-preds.lst
  LUCENE_INDEX_ANALYZER_DEFAULT org.apache.lucene.analysis.standard.StandardAnalyzer
  LUCENE_INDEX_OPTIONS "-nn -v"

>>> fbi.lookup('f_m.0h54qv8')
<org.apache.lucene.document.Document at 0x7f87e8d74a70 jclass=org/apache/lucene/document/Document jself=<LocalRef obj=0x1c95cc8 at 0x7f87ea42c350>>

>>> fbi.getFieldValue('f_m.0h54qv8', 'rs_label')
'"Henry Higgins"@en'

>>> fbi.getFieldValues('f_m.0h54qv8', 'r_type')
['f_common.topic', 'f_people.deceased_person', 'f_people.person']

>>> fbi.getFieldValues('/m/0h54qv8', 'r_type')
['f_common.topic', 'f_people.deceased_person', 'f_people.person']

>>> fbi.getDocumentId('f_m.0h54qv8')
69442668

>>> fbi.getDocument(69442668)
<org.apache.lucene.document.Document at 0x7f87e8d74a70 jclass=org/apache/lucene/document/Document jself=<LocalRef obj=0x1c95cb8 at 0x7f87ea42c370>>

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=4)

>>> pp.pprint(fbi.fetch('f_m.0h54qv8'))
{   'f_common.topic.article': 'f_m.0h54qvd',
    'f_common.topic.description': '"Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins.\\nHe was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists\' Club from 1861 to 1881.\\nHe especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum\'s collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en',
    'f_common.topic.notable_for': 'f_g.125crzjzl',
    'f_common.topic.notable_types': 'f_m.022tfrk',
    'f_common.topic.topic_equivalent_webpage': [   'we_Henry_Higgins_(botanist)',
                                                   'we_index.html?curid=32997517'],
    'f_people.deceased_person.date_of_death': '"1893"^^<http://www.w3.org/2001/XMLSchema#gYear>',
    'f_people.person.date_of_birth': '"1814"^^<http://www.w3.org/2001/XMLSchema#gYear>',
    'f_people.person.gender': 'f_m.05zppz',
    'f_people.person.profession': 'f_m.036n1',
    'f_type.object.name': '"Henry Higgins"@en',
    'fk_key.wikipedia.en': [   '"Henry_Higgins_$0028botanist$0029"',
                               '"Henry_Hugh_Higgins"'],
    'fk_key.wikipedia.en_id': '"32997517"',
    'fk_key.wikipedia.en_title': '"Henry_Higgins_$0028botanist$0029"',
    'r_type': [   'f_common.topic',
                  'f_people.deceased_person',
                  'f_people.person'],
    'rs_label': '"Henry Higgins"@en',
    'subject': 'f_m.0h54qv8'}

>>> fbi.search('Henry Higgins AND r_type:f_people.person', maxHits=5)
[(69442668, 8.640470504760742), (93746332, 8.640470504760742), (3290194, 4.759587287902832), (15812, 4.0124993324279785), (63493, 4.0124993324279785)]

>>> pp.pprint(fbi.fetch(69442668))
{   'f_common.topic.article': 'f_m.0h54qvd',
    'f_common.topic.description': '"Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins.\\nHe was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists\' Club from 1861 to 1881.\\nHe especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum\'s collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en',
    'f_common.topic.notable_for': 'f_g.125crzjzl',
    'f_common.topic.notable_types': 'f_m.022tfrk',
    'f_common.topic.topic_equivalent_webpage': [   'we_Henry_Higgins_(botanist)',
                                                   'we_index.html?curid=32997517'],
    'f_people.deceased_person.date_of_death': '"1893"^^<http://www.w3.org/2001/XMLSchema#gYear>',
    'f_people.person.date_of_birth': '"1814"^^<http://www.w3.org/2001/XMLSchema#gYear>',
    'f_people.person.gender': 'f_m.05zppz',
    'f_people.person.profession': 'f_m.036n1',
    'f_type.object.name': '"Henry Higgins"@en',
    'fk_key.wikipedia.en': [   '"Henry_Higgins_$0028botanist$0029"',
                               '"Henry_Hugh_Higgins"'],
    'fk_key.wikipedia.en_id': '"32997517"',
    'fk_key.wikipedia.en_title': '"Henry_Higgins_$0028botanist$0029"',
    'r_type': [   'f_common.topic',
                  'f_people.deceased_person',
                  'f_people.person'],
    'rs_label': '"Henry Higgins"@en',
    'subject': 'f_m.0h54qv8'}

>>> pp.pprint(fbi.retrieve('Henry Higgins AND r_type:f_people.person', maxHits=5))
[   {   '_docid': 69442668,
        '_score': 8.640470504760742,
        'f_common.topic.article': 'f_m.0h54qvd',
        'f_common.topic.description': '"Henry Hugh Higgins was an English botanist, bryologist, geologist, curator and clergyman. He is cited as an authority in scientific classification, as Higgins.\\nHe was inspector of the National Schools in Liverpool from 1842 to 1848 and chaplain to the Rainhill Asylum, also in Liverpool. He was also president of the Liverpool Field Naturalists\' Club from 1861 to 1881.\\nHe especially worked on the Ravenhead collections, almost wholly made up of Upper Carboniferous flora, fish, bivalves and insect remains. Higgins had suggested that Ravenhead donate his collections to the Liverpool Museum and the donation gained a home with the construction of the railway in 1870, which exposed two Carboniferous seams known as the Upper and Lower Ravenhead. Most of Liverpool Museum\'s collections survived the Liverpool Blitz of May 1941 which practically destroyed the Museum itself, but the entire Ravenhead collection was lost in the fire."@en',
        'f_common.topic.notable_for': 'f_g.125crzjzl',
        'f_common.topic.notable_types': 'f_m.022tfrk',
        'f_common.topic.topic_equivalent_webpage': [   'we_Henry_Higgins_(botanist)',
                                                       'we_index.html?curid=32997517'],
        'f_people.deceased_person.date_of_death': '"1893"^^<http://www.w3.org/2001/XMLSchema#gYear>',
        'f_people.person.date_of_birth': '"1814"^^<http://www.w3.org/2001/XMLSchema#gYear>',
        'f_people.person.gender': 'f_m.05zppz',
        'f_people.person.profession': 'f_m.036n1',
        'f_type.object.name': '"Henry Higgins"@en',
        'fk_key.wikipedia.en': [   '"Henry_Higgins_$0028botanist$0029"',
                                   '"Henry_Hugh_Higgins"'],
        'fk_key.wikipedia.en_id': '"32997517"',
        'fk_key.wikipedia.en_title': '"Henry_Higgins_$0028botanist$0029"',
        'r_type': [   'f_common.topic',
                      'f_people.deceased_person',
                      'f_people.person'],
        'rs_label': '"Henry Higgins"@en',
        'subject': 'f_m.0h54qv8'},
    {   '_docid': 93746332,
        '_score': 8.640470504760742,
        'f_common.topic.article': 'f_m.0ll1ywd',
        'f_common.topic.description': '"Henry Higgins was an English bullfighter, who was born in Bogot\\u00E1, Colombia in 1944. He died as a result of a hang-gliding accident, while demonstrating it by jumping off a 200 ft high hill in 1978. He was educated at King Williams College in the Isle of Man."@en',
        'f_common.topic.notable_for': 'f_g.12q4p343t',
        'f_common.topic.notable_types': 'f_m.04kr',
        'f_common.topic.topic_equivalent_webpage': [   'we_Henry_Higgins_(bullfighter)',
                                                       'we_index.html?curid=36799328'],
        'f_people.person.profession': 'f_m.01kr58',
        'f_type.object.name': '"Henry Higgins"@en',
        'fk_key.wikipedia.en': '"Henry_Higgins_$0028bullfighter$0029"',
        'fk_key.wikipedia.en_id': '"36799328"',
        'fk_key.wikipedia.en_title': '"Henry_Higgins_$0028bullfighter$0029"',
        'r_type': ['f_common.topic', 'f_people.person'],
        'rs_label': '"Henry Higgins"@en',
        'subject': 'f_m.0ll1yw8'},
    {   '_docid': 3290194,
        '_score': 4.759587287902832,
        'f_common.topic.article': 'f_m.04j19m',
        'f_common.topic.description': '"Terence Langley Higgins, Baron Higgins KBE DL PC is a retired British Conservative politician and Commonwealth Games silver medalist winner for England.\\nHiggins was Member of Parliament for Worthing from 1964 to 1997, and Financial Secretary to the Treasury between 1972 and 1974.\\nHe served in the RAF from 1946 to 1948, and was a member of British Olympic Team in 1948 and 1952. He was created a life peer as Baron Higgins, of Worthing in the County of West Sussex on 28 October 1997. While in opposition, he served as the Conservative shadow minister for work and pensions in the House of Lords. He was appointed a Knight Commander of the Order of the British Empire in the 1993 New Years Honours List."@en',
        'f_common.topic.notable_for': 'f_g.1257q_vsp',
        'f_common.topic.notable_types': 'f_m.02xlh55',
        'f_common.topic.topic_equivalent_webpage': [   'we_Terence_Higgins,_Baron_Higgins',
                                                       'we_index.html?curid=1215819'],
        'f_government.politician.government_positions_held': 'f_m.04ntzv0',
        'f_government.politician.party': 'f_m.04htc0_',
        'f_people.person.date_of_birth': '"1928-01-18"^^<http://www.w3.org/2001/XMLSchema#date>',
        'f_people.person.gender': 'f_m.05zppz',
        'f_people.person.nationality': 'f_m.07ssc',
        'f_type.object.name': '"Terence Higgins, Baron Higgins"@en',
        'fk_key.en': '"terence_higgins_baron_higgins"',
        'fk_key.wikipedia.en': [   '"Baron_Higgins"',
                                   '"Lord_Higgins"',
                                   '"Terence_Higgins$002C_Baron_Higgins"',
                                   '"Terence_Langley_Higgins"'],
        'fk_key.wikipedia.en_id': '"1215819"',
        'fk_key.wikipedia.en_title': '"Terence_Higgins$002C_Baron_Higgins"',
        'r_type': [   'f_common.topic',
                      'f_government.politician',
                      'f_people.person',
                      'f_royalty.chivalric_order_member',
                      'f_royalty.noble_person'],
        'rs_label': '"Terence Higgins, Baron Higgins"@en',
        'subject': 'f_m.04j19g'},
    {   '_docid': 15812,
        '_score': 4.0124993324279785,
        'f_common.topic.article': 'f_m.03chhdq',
        'f_common.topic.description': '"Debra Elaine Higgins is a Canadian provincial politician, who was the Saskatchewan New Democratic Party member of the Legislative Assembly of Saskatchewan for the constituency of Moose Jaw Wakamow from 1999 to 2011. She is currently the mayor of Moose Jaw, Saskatchewan, having been elected as the city\'s first female mayor in the Saskatchewan municipal elections, 2012.\\nShe was first elected in the 1999 election and was re-elected in the 2003 and 2007 elections. Higgins served in the cabinet of Lorne Calvert as the Minister of Labour and later as the Minister of Learning.\\nAfter the defeat of the NDP government in the 2007 election, Higgins has served as the NDP critic for municipal affairs, liquor and gaming, and women\'s issues.\\nOn January 30, 2009, she announced her bid to succeed Calvert as Saskatchewan NDP leader at the party\'s June 2009 leadership convention. Higgins ran on the theme of renewal and defeating Premier Brad Wall. In the end she finished last of four candidates with Dwain Lingenfelter being the victor.\\nIn the 2011 election Higgins was defeated in her riding by Greg Lawrence of the Saskatchewan Party.\\nHiggins got her start in politics when she became involved with the UFCW union in 1982 while working at a Safeway grocery store. She later served as the President of the UFCW Council from 1993 to 1999, during which period she also served as a table officer for the Moose Jaw & District Labour Council."@en',
        'f_common.topic.notable_for': 'f_g.12556xmf4',
        'f_common.topic.notable_types': 'f_m.04kr',
        'f_common.topic.topic_equivalent_webpage': [   'we_Deb_Higgins',
                                                       'we_index.html?curid=13762292'],
        'f_government.politician.party': 'f_m.0lr0_qy',
        'f_people.person.date_of_birth': '"1954"^^<http://www.w3.org/2001/XMLSchema#gYear>',
        'f_people.person.gender': 'f_m.02zsn',
        'f_people.person.places_lived': 'f_m.0wllybw',
        'f_type.object.name': '"Deb Higgins"@en',
        'fk_key.en': '"deb_higgins"',
        'fk_key.source.videosurf': '"125328"',
        'fk_key.wikipedia.en': '"Deb_Higgins"',
        'fk_key.wikipedia.en_id': '"13762292"',
        'fk_key.wikipedia.en_title': '"Deb_Higgins"',
        'r_type': ['f_common.topic', 'f_people.person'],
        'rs_label': '"Deb Higgins"@en',
        'subject': 'f_m.03chhdl'},
    {   '_docid': 63493,
        '_score': 4.0124993324279785,
        'f_common.topic.article': 'f_m.03d0fnj',
        'f_common.topic.description': '"Terence John Higgins is Chief Justice of the Australian Capital Territory, a territory of Australia."@en',
        'f_common.topic.notable_for': 'f_g.125dtp53k',
        'f_common.topic.notable_types': 'f_m.04kr',
        'f_common.topic.topic_equivalent_webpage': [   'we_Terence_Higgins_(judge)',
                                                       'we_index.html?curid=14320511'],
        'f_people.person.date_of_birth': '"1943"^^<http://www.w3.org/2001/XMLSchema#gYear>',
        'f_people.person.education': 'f_m.0sw2b_6',
        'f_people.person.gender': 'f_m.05zppz',
        'f_people.person.place_of_birth': 'f_m.0chghy',
        'f_type.object.name': '"Terence Higgins"@en',
        'fk_key.en': '"terence_john_higgins"',
        'fk_key.wikipedia.en': [   '"Terence_Higgins_$0028judge$0029"',
                                   '"Terence_John_Higgins"',
                                   '"Terrence_John_Higgins"'],
        'fk_key.wikipedia.en_id': '"14320511"',
        'fk_key.wikipedia.en_title': '"Terence_Higgins_$0028judge$0029"',
        'r_type': ['f_common.topic', 'f_people.person'],
        'rs_label': '"Terence Higgins"@en',
        'subject': 'f_m.03d0fnd'}]