How did you explore the files? - Pandora papers 2021

In cases where the information came in spreadsheet form, ICIJ removed duplicates and combined it into a master spreadsheet. For PDF or document files, ICIJ used programming languages such as Python to automate data extraction and structuring as much as possible.

Oct 6, 2021 - 16:31

How did you explore the files? - Pandora papers 2021

Only 4% of the files were structured, with data organized in tables (spreadsheets, CSV files and a few “dbf files”).

To explore and analyze the information in the Pandora Papers, ICIJ identified files that contained beneficial ownership information by company and jurisdiction and structured it accordingly. Each provider’s data required a different process.

In cases where the information came in spreadsheet form, ICIJ removed duplicates and combined it into a master spreadsheet. For PDF or document files, ICIJ used programming languages such as Python to automate data extraction and structuring as much as possible.

In more complex cases, ICIJ used machine learning and other tools, including the Fonduer and Scikit-learn software, to identify and separate specific forms from longer documents.

Some provider forms were handwritten, requiring ICIJ to extract information manually.

Once the information was extracted and structured, ICIJ generated lists that linked beneficial owners to the companies they owned in specific jurisdictions. In some cases, information about where or when a company was registered wasn’t available. In others, information was missing about when a person or an entity had become the owner of the company, among other details.

After structuring the data, ICIJ used graphic platforms (Neo4J and Linkurious) to generate visualizations and make them searchable. This allowed reporters to explore connections between people and companies across providers.

To identify potential story subjects in the data, ICIJ matched information in the leak against other data sets: sanctions lists, previous leaks, public corporate records, media lists of billionaires and public lists of political leaders.

ICIJ’s partner in Sweden, SVT, generated spreadsheets containing data extracted from passports found in the Pandora Papers.

ICIJ shared records with media partners using Datashare, secure research and analysis tool developed by ICIJ’s technical team. Datashare’s batch-search function helped reporters match some public figures with the data.

The leak contains routine documents that service providers gather for due diligence – news articles, Wikipedia entries, information from financial data provider World-Check – that don’t necessarily confirm whether a person is hiding wealth in a secrecy jurisdiction. ICIJ used machine learning to tag such files in Datashare, enabling reporters to exclude them from their searches.

Our 150 media partners shared tips, leads and other information of interest using ICIJ’s global I-Hub, a secure social media and messaging platform. Throughout the project, ICIJ held extensive training sessions for partners on the use of ICIJ technology to explore, mine and better understand the files.