FreebaseTools is a small, lightweight toolkit to pre-process, filter, index and store
Freebase knowledge base into a fast and relatively "small" Lucene index. KB Variants
such as BaseKB Gold which is used as the reference KB for NIST's
TAC-KBP entity discovery and linking
(EDL) evaluations can also be handled.
- significant size reduction: reduces the official Google RDF dump from
about 3B triples to around 880M (or from 400GB uncompressed to about 60GB)
- relatively small deployment size: the English-only view of TAC-KBP's BaseKB
Gold can be stored in a 12.5 GB index directory which can be efficiently
searched by a standard desktop machine or laptop without huge amounts of memory
- powerful ranked Lucene free text and fuzzy search on Freebase's description/
name/label/etc. text fields combined with hard restrictions on
fields such as types; search individual fields or across all indexed
text fields of a subject
- multi-lingual search that uses configurable language-appropriate
tokenizers; search per language or combine search strings from
- simple command-line API to explore the data and extract relevant views
(e.g., all subjects of type person)
- simple Python interface for easy integration and interactive use
- this is not a replacement for a triple store: while many interesting and
useful queries can be expressed and executed via Lucene, it is not a database
system and some more complex SPARQL queries can only be emulated via code
(which might not be very efficient). There is also no inference.
- the toolkit is an early prototype and does not yet have a fully mature set
of features and APIs
How It Works
Index generation proceeds in two phase:
- Pre-processing and shrinking: this phase abbreviates and normalizes URIs
so that they can be easily referenced in Lucene queries, and ignores unwanted
triples (customized by a number of "ignore-*' files). In particular:
- common namespace prefixes such as http://rdf.freebase.com/ns/ and
http://rdf.basekb.com/ns/ become short prefixes such as `f_' or `rs_', etc.
- redundant triples (e.g., <f_type.object.type> and <f_type.type.instance>)
are eliminated (those are already eliminated from BaseKB)
- triples referencing ignored language strings are eliminated
- triples referencing ignored language Wikipedia pages are eliminated
- a number of other useless (for your task) predicates such as ISBN numbers,
MusicBrainz track listings, various keys, etc. are eliminated
- The normalized, shrunk and sorted triples file gets indexed via Lucene.
Each subject and all of its predicates become a Lucene document. Each
predicate and its values become stored fields in the document. A small set of
those fields (customizable) are also indexed to allow efficient querying for
names and variants, descriptions, type fields, etc. Lucene efficiently
compresses all field data, which results in a relatively small index size.
As of version 1.2.0, a couple of pre-built indexes are available for download so
you can get started right away without having to build your own. See the
README file on how to download and install them.
Java 1.7 and the following libraries which have also been provided
already in the lib directory:
Different versions might work as well, but your milage may vary. All
code and scripts have been developed and tested under Linux
only (specifically openSUSE). They should generally work on MacOS as
well but will definitely require adaptation for Windows.
If you want to use the Python interface, you will need Python 2.7 and
a compatible version of the jnius package.
Installation and Use
See the README file.
Questions, Suggestions and Comments
Please send any questions or comments to Hans Chalupsky (hans AT
isi . edu).