Aleph is a tool for indexing large amounts of both documents (PDF, Word, HTML) and structured (CSV, XLS, SQL) data for easy browsing and search. It is built with investigative reporting as a primary use case. Aleph allows cross-referencing mentions of well-known entities (such as people and companies) against watchlists, e.g. from prior research or public datasets.
Here are some key features:
- Web-based search across large document and data sets.
- Imports many file formats, including popular office formats, spreadsheets, email and zipped archives. Processing includes optical character recognition, language and encoding detection and named entity extraction.
- Load structured entity graph data from databases and CSV files. This allows navigation of complex datasets like companies registries, sanctions lists or procurement data. Import tools for OpenSanctions. are included.
- Receive notifications for new search matches with a personal watchlist.
- OAuth authorization and access control on a per-source and per-watchlist basis.
The goal of
aleph 3.0.0 is to harmonise the handling of data inside the index. Instead of having different formats and mappings for documents, entities, table rows and document pages, there is now just one type of index object: an entity.
This means that document-based data is now completely ‘translated’ to the
followthemoney ontology used by
aleph(meaning that in theory, each page of a document and each row of a table is now a node in the object graph of the
OCR_VISION_API, it will enable use of the Google Vision API for optical character recognition.
/api/2/collections/<id>/ingestAPI now only accepts a single file, or no file (which will create a folder). The response body contains only the ID of the generated document. The status code on success is now 201, not 200.
Copyright (c) 2014-2015 Friedrich Lindenberg
Copyright (c) 2016-2017 Journalism Development Network, Inc.