🤬
  • ■ ■ ■ ■ ■ ■
    .streamlit/config.toml
     1 +[theme]
     2 +base="light"
     3 +primaryColor="#F63333"
  • ■ ■ ■ ■ ■
    README.md
    1  -# Project name
     1 +# 451 Corporate Risk Miner
    2 2   
    3 3  ## Team Members
    4  -This section is a list of team members, and possibly links to GitHub/GitLab/LinkedIn/personal blog pages for members.
     4 +Elena Dulskyte [linkedin](https://www.linkedin.com/in/elena-dulskyte-50b83aa2/)
     5 + 
     6 +Marko Sahan [github](http://github.com/sahanmar) [linkedin](https://www.linkedin.com/in/msahan/)
     7 + 
     8 +Peter Zatka-Haas [github](http://github.com/peterzh) [linkedin](https://www.linkedin.com/in/peterzatkahaas)
    5 9   
    6 10  ## Tool Description
    7  -This sections discusses the purpose and motivation for the tool, and how it addresses a tool need you've identified.
     11 + 
     12 +Financial crime journalists need to dig through complex corporate ownership databases (i.e. databases of companies and the people/companies that control those companies) in order to find associations to criminal activity. They face several problems along the way:
     13 +1. It is difficult to search across multiple publicly-available databases (UK Companies House, Sanction lists, ICIJ Leaks, VK)
     14 +2. There are multiple ‘risk signatures’ associated with criminal activity (e.g. Cyclical or long-chain ownership, links to sanctions, etc) and different journalists prioritise different kinds of signatures in their investigation
     15 +3. The number of corporate networks is overwhelming, and so it is hard to prioritise which corporate ownership structures are more ‘risky’ than others
     16 + 
     17 +451 Corporate Risk Miner allows a user to navigate over different corporate ownership networks extracted from UK Companies House (UKCH) to identify and visualise those exhibiting risk signatures associated with financial crime. Example risk signatures include:
     18 +* Cyclic ownership: Circular company ownership (e.g. Company A owns Company B which owns Company C which owns Company A)
     19 +* Long-chain ownership: Long chains of corporate ownership (e.g. Person A controls company A. Company A is an officer for Company B. Company B is an officer of company C. etc)
     20 +* Links to tax havens: Corporate networks which involve companies/people associated with tax haven or secrecy jurisdictions
     21 +* Presence of proxy directors: Proxy directors are individual people who are registered as a company director on paper but who are likely never involved in the running of the business.
     22 +* Links to sanctioned entities: Official sanctioned people or companies, from sources such as the UN Sanctions List.
     23 +* Links to politically-exposed persons (PEPs)
     24 +* Links to disqualified directors
     25 + 
     26 +The user can customise the relative importance of each risk signature for their search. The app then computes a **total risk score** for each corporate network in UKCH, and outlines the details of the most high-risk networks. The user can export these network results as a .csv file for later viewing.
    8 27   
    9 28  ## Installation
    10  -This section includes detailed instructions for installing the tool, including any terminal commands that need to be executed and dependencies that need to be installed. Instructions should be understandable by non-technical users (e.g. someone who knows how to open a terminal and run commands, but isn't necessarily a programmer), for example:
    11 29   
    12 30  1. Make sure you have Python version 3.8 or greater installed
    13 31   
    14 32  2. Download the tool's repository using the command:
    15  - 
    16  - git clone https://github.com/bellingcat/hackathon-submission-template.git
     33 +```
     34 +git clone https://github.com/sahanmar/451
     35 +```
    17 36   
    18 37  3. Move to the tool's directory and install the tool
     38 +```
     39 +cd 451
     40 +pip install -r requirements.txt
     41 +```
    19 42   
    20  - cd hackathon-submission-template
    21  - pip install .
     43 +4. Start the streamlit app
     44 +```
     45 +streamlit run app/app.py
     46 +```
     47 + 
     48 +5. On your web browser, load [http://localhost:8501](http://localhost:8501)
    22 49   
    23 50  ## Usage
    24  -This sections includes detailed instructions for using the tool. If the tool has a command-line interface, include common commands and arguments, and some examples of commands and a description of the expected output. If the tool has a graphical user interface or a browser interface, include screenshots and describe a common workflow.
     51 + 
     52 +TBD
    25 53   
    26 54  ## Additional Information
    27 55  This section includes any additional information that you want to mention about the tool, including:
    skipped 1 lines
    29 57  - Any limitations of the current implementation of the tool
    30 58  - Motivation for design/architecture decisions
    31 59   
     60 +### Limitations
     61 +* Limited to cliques of ??? hop distance owing to space limitation
     62 +* Cyclicity calculation assumes an undirected graph to save computational time. This could be improved by taking into account specific directions of ownership.
     63 +* Entity resolution for company/people entities could be improved
     64 +* Graph visualisation for large corporate networks can be too cluttered to be useful.
     65 + 
     66 +### Potential next steps
     67 +* Expand to corporate ownership databases outside of the UK, for example using OpenCorporates data.
     68 +* Incorporate more external data sources identifying criminal or potentially-criminal activity for companies and people.
     69 +*
     70 + 
  • ■ ■ ■ ■ ■ ■
    app/app.py
     1 +import json
     2 +import pandas as pd
     3 +import streamlit as st
     4 +from streamlit_agraph import agraph, Config
     5 +from utils import (
     6 + build_agraph_components,
     7 + get_subgraph_nodes_df,
     8 + get_subgraph_df,
     9 + get_subgraph_edges_df,
     10 + get_subgraph_with_risk_score,
     11 + build_markdown_strings_for_node,
     12 +)
     13 + 
     14 + 
     15 +st.set_page_config(layout="wide")
     16 + 
     17 + 
     18 +SLIDER_MIN = 0
     19 +SLIDER_MAX = 100
     20 +SLIDER_DEFAULT = 50
     21 +DEFAULT_NUM_SUBGRAPHS_TO_SHOW = 3
     22 +GRAPH_PLOT_HEIGHT_PX = 400
     23 +GRAPH_SIZE_RENDER_LIMIT = 40
     24 +subgraphs = get_subgraph_df()
     25 + 
     26 +with st.sidebar:
     27 + st.title("451 Corporate Risk Miner")
     28 + 
     29 + weight_chains = (
     30 + st.slider(
     31 + "Long ownership chains",
     32 + min_value=SLIDER_MIN,
     33 + max_value=SLIDER_MAX,
     34 + value=SLIDER_DEFAULT,
     35 + disabled=True,
     36 + )
     37 + / SLIDER_MAX
     38 + )
     39 + weight_cyclic = (
     40 + st.slider(
     41 + "Cyclic ownership",
     42 + min_value=SLIDER_MIN,
     43 + max_value=SLIDER_MAX,
     44 + value=SLIDER_DEFAULT,
     45 + )
     46 + / SLIDER_MAX
     47 + )
     48 + weight_psc_haven = (
     49 + st.slider(
     50 + "Persons of significant control associated with tax havens",
     51 + min_value=SLIDER_MIN,
     52 + max_value=SLIDER_MAX,
     53 + value=SLIDER_DEFAULT,
     54 + disabled=True,
     55 + )
     56 + / SLIDER_MAX
     57 + )
     58 + weight_pep = (
     59 + st.slider(
     60 + "Officers/PSCs are politically exposed",
     61 + min_value=SLIDER_MIN,
     62 + max_value=SLIDER_MAX,
     63 + value=SLIDER_DEFAULT,
     64 + disabled=True,
     65 + )
     66 + / SLIDER_MAX
     67 + )
     68 + weight_sanctions = (
     69 + st.slider(
     70 + "Officers/PSCs/Companies are sanctioned",
     71 + min_value=SLIDER_MIN,
     72 + max_value=SLIDER_MAX,
     73 + value=SLIDER_DEFAULT,
     74 + disabled=True,
     75 + )
     76 + / SLIDER_MAX
     77 + )
     78 + weight_disqualified = (
     79 + st.slider(
     80 + "Officers are disqualified directors",
     81 + min_value=SLIDER_MIN,
     82 + max_value=SLIDER_MAX,
     83 + value=SLIDER_DEFAULT,
     84 + disabled=True,
     85 + )
     86 + / SLIDER_MAX
     87 + )
     88 + 
     89 + custom_names = st.file_uploader(
     90 + label="Custom persons/companies of interest", type="csv"
     91 + )
     92 + 
     93 + if custom_names:
     94 + custom_names = pd.read_csv(custom_names, header=None)[0].tolist()
     95 + st.write(custom_names)
     96 + 
     97 + go = st.button("Go")
     98 + 
     99 + 
     100 +with st.container():
     101 + 
     102 + subgraph_with_risk_scores = get_subgraph_with_risk_score(
     103 + subgraphs,
     104 + weight_chains=weight_chains,
     105 + weight_cyclic=weight_cyclic,
     106 + weight_psc_haven=weight_psc_haven,
     107 + weight_pep=weight_pep,
     108 + weight_sanctions=weight_sanctions,
     109 + weight_disqualified=weight_disqualified,
     110 + )
     111 + 
     112 + st.dataframe(data=subgraph_with_risk_scores, use_container_width=True)
     113 + 
     114 + selected_subgraph_hashes = st.multiselect(
     115 + label="Select corporate network(s) to explore",
     116 + options=list(subgraph_with_risk_scores.index),
     117 + default=list(
     118 + subgraph_with_risk_scores.head(DEFAULT_NUM_SUBGRAPHS_TO_SHOW).index
     119 + ),
     120 + )
     121 + 
     122 + 
     123 +with st.container():
     124 + num_subgraphs_to_display = len(selected_subgraph_hashes)
     125 + 
     126 + if num_subgraphs_to_display > 0:
     127 + cols = st.columns(num_subgraphs_to_display)
     128 + 
     129 + for c, subgraph_hash in enumerate(selected_subgraph_hashes):
     130 + nodes_selected = get_subgraph_nodes_df(subgraph_hash)
     131 + edges_selected = get_subgraph_edges_df(subgraph_hash)
     132 + 
     133 + with cols[c]:
     134 + if len(nodes_selected) < GRAPH_SIZE_RENDER_LIMIT:
     135 + (node_objects, edge_objects) = build_agraph_components(
     136 + nodes_selected, edges_selected
     137 + )
     138 + agraph(
     139 + nodes=node_objects,
     140 + edges=edge_objects,
     141 + config=Config(
     142 + width=round(1080 / num_subgraphs_to_display),
     143 + height=GRAPH_PLOT_HEIGHT_PX,
     144 + nodeHighlightBehavior=True,
     145 + highlightColor="#F7A7A6",
     146 + directed=True,
     147 + collapsible=True,
     148 + ),
     149 + )
     150 + else:
     151 + st.error("Subgraph is too large to render")
     152 + 
     153 + # Build markdown strings for representing metadata
     154 + markdown_strings = build_markdown_strings_for_node(nodes_selected)
     155 + 
     156 + st.markdown(":busts_in_silhouette: **People**")
     157 + for p in markdown_strings["people"]:
     158 + st.markdown(p)
     159 + 
     160 + st.markdown(":office: **Companies**")
     161 + for c in markdown_strings["companies"]:
     162 + st.markdown(c)
     163 + 
  • ■ ■ ■ ■ ■ ■
    app/utils.py
     1 +import streamlit as st
     2 +from streamlit_agraph import Node, Edge
     3 +import json
     4 +import pandas as pd
     5 + 
     6 +NODE_COLOUR_NON_DODGY = "#72EF77"
     7 +NODE_COLOUR_DODGY = "#F63333"
     8 +NODE_IMAGE_PERSON = "http://i.ibb.co/LrY3tfw/747376.png" # https://www.flaticon.com/free-icon/user_747376
     9 +NODE_IMAGE_COMPANY = "http://i.ibb.co/fx6r1dZ/4812244.png" # https://www.flaticon.com/free-icon/company_4812244
     10 + 
     11 + 
     12 +@st.cache()
     13 +def get_subgraph_df():
     14 + return pd.read_parquet("./data/network.parquet", engine="pyarrow").set_index(
     15 + "network_id"
     16 + )
     17 + 
     18 + 
     19 +@st.cache()
     20 +def get_subgraph_nodes_df(subgraph_hash):
     21 + return pd.read_parquet(
     22 + "./data/nodes.parquet",
     23 + filters=[[("subgraph_hash", "=", subgraph_hash)]],
     24 + engine="pyarrow",
     25 + )
     26 + 
     27 + 
     28 +@st.cache()
     29 +def get_subgraph_edges_df(subgraph_hash):
     30 + return pd.read_parquet(
     31 + "./data/edges.parquet",
     32 + filters=[[("subgraph_hash", "=", subgraph_hash)]],
     33 + engine="pyarrow",
     34 + )
     35 + 
     36 + 
     37 +def get_subgraph_with_risk_score(
     38 + subgraph_table,
     39 + weight_chains,
     40 + weight_cyclic,
     41 + weight_psc_haven,
     42 + weight_pep,
     43 + weight_sanctions,
     44 + weight_disqualified,
     45 +):
     46 + 
     47 + out = subgraph_table.copy()
     48 + out["total_risk"] = out["cyclicity"] * weight_cyclic / out["cyclicity"].max()
     49 + return out.sort_values(by="total_risk", ascending=False)
     50 + 
     51 + 
     52 +def build_agraph_components(
     53 + nodes,
     54 + edges,
     55 +):
     56 + """Create agraph object from node and edge list"""
     57 + 
     58 + node_objects = []
     59 + edge_objects = []
     60 + 
     61 + for _, row in nodes.iterrows():
     62 + # node_metadata = json.loads(row["node_metadata"])
     63 + node_objects.append(
     64 + Node(
     65 + id=row["node_id"],
     66 + label="\n".join(row["node_id"].split("|")[0].split(" ")),
     67 + size=20,
     68 + # color=NODE_COLOUR_DODGY
     69 + # if (row["pep"] > 0 or row["sanction"] > 0)
     70 + # else NODE_COLOUR_NON_DODGY,
     71 + image=NODE_IMAGE_PERSON,
     72 + # if row["is_person"] == 1
     73 + # else NODE_IMAGE_COMPANY,
     74 + shape="circularImage",
     75 + )
     76 + )
     77 + 
     78 + for _, row in edges.iterrows():
     79 + edge_objects.append(
     80 + Edge(
     81 + source=row["source"],
     82 + # label=row["type"][0],
     83 + target=row["target"],
     84 + )
     85 + )
     86 + 
     87 + return (node_objects, edge_objects)
     88 + 
     89 + 
     90 +def build_markdown_strings_for_node(nodes_selected):
     91 + """Separate into People and Company strings"""
     92 + 
     93 + markdown_strings = dict()
     94 + markdown_strings["companies"] = []
     95 + markdown_strings["people"] = []
     96 + 
     97 + for _, row in nodes_selected.iterrows():
     98 + node_metadata = {
     99 + "name": row["node_id"],
     100 + "is_proxy": row["proxy_dir"],
     101 + "is_person": True,
     102 + }
     103 + 
     104 + # node_metadata = json.loads(row["node_metadata"])
     105 + # node_sanctions = (
     106 + # "" if row["sanction"] == 0 else f"! SANCTIONED: {row['sanction_metadata']}"
     107 + # )
     108 + # node_pep = "" if row["pep"] == 0 else f"! PEP: {row['pep_metadata']}"
     109 + 
     110 + node_sanctions = ""
     111 + node_pep = ""
     112 + 
     113 + if node_metadata["is_person"]:
     114 + # node_title = f"{node_metadata['name']} [{node_metadata['nationality']}/{node_metadata['yob']}/{node_metadata['mob']}]"
     115 + node_title = f"{node_metadata['name']}"
     116 + key = "people"
     117 + else:
     118 + # node_title = f"{node_metadata['name']} [{row['jur']}/{node_metadata['reg']}/{node_metadata['address']}]"
     119 + node_title = f"{node_metadata['name']}"
     120 + key = "companies"
     121 + 
     122 + markdown_strings[key].append(
     123 + "\n".join(
     124 + [x for x in ["```", node_title, node_pep, node_sanctions] if len(x) > 0]
     125 + )
     126 + )
     127 + 
     128 + return markdown_strings
     129 + 
  • ■ ■ ■ ■ ■ ■
    requirements.txt
    1  -elementpath
    2  -csv
    3  -urllib
    4  -requests==2.22.0
     1 +elementpath==4.8.6
    5 2  beautifulsoup4==4.8.1
     3 +altair==4.2.0
     4 +appnope==0.1.3
     5 +asttokens==2.0.8
     6 +attrs==22.1.0
     7 +aws-mfa==0.0.12
     8 +backcall==0.2.0
     9 +blinker==1.5
     10 +boto3==1.24.80
     11 +botocore==1.27.80
     12 +cachetools==5.2.0
     13 +certifi==2022.9.14
     14 +charset-normalizer==2.1.1
     15 +click==8.1.3
     16 +commonmark==0.9.1
     17 +debugpy==1.6.3
     18 +decorator==5.1.1
     19 +entrypoints==0.4
     20 +executing==1.0.0
     21 +gitdb==4.0.9
     22 +GitPython==3.1.27
     23 +idna==3.4
     24 +importlib-metadata==4.12.0
     25 +ipykernel==6.15.3
     26 +ipython==8.5.0
     27 +isodate==0.6.1
     28 +jedi==0.18.1
     29 +Jinja2==3.1.2
     30 +jmespath==1.0.1
     31 +jsonschema==4.16.0
     32 +jupyter-core==4.11.1
     33 +jupyter_client==7.3.5
     34 +MarkupSafe==2.1.1
     35 +matplotlib-inline==0.1.6
     36 +nest-asyncio==1.5.5
     37 +networkx==2.8.6
     38 +numpy==1.23.3
     39 +packaging==21.3
     40 +pandas==1.5.0
     41 +parso==0.8.3
     42 +pexpect==4.8.0
     43 +pickleshare==0.7.5
     44 +Pillow==9.2.0
     45 +polars==0.14.13
     46 +prompt-toolkit==3.0.31
     47 +protobuf==3.20.1
     48 +psutil==5.9.2
     49 +ptyprocess==0.7.0
     50 +pure-eval==0.2.2
     51 +pyarrow==9.0.0
     52 +pydeck==0.8.0b3
     53 +Pygments==2.13.0
     54 +Pympler==1.0.1
     55 +pyparsing==3.0.9
     56 +pyrsistent==0.18.1
     57 +python-dateutil==2.8.2
     58 +pytz==2022.2.1
     59 +pytz-deprecation-shim==0.1.0.post0
     60 +pyzmq==24.0.1
     61 +rdflib==6.2.0
     62 +requests==2.28.1
     63 +rich==12.5.1
     64 +s3transfer==0.6.0
     65 +semver==2.13.0
     66 +six==1.16.0
     67 +smmap==5.0.0
     68 +stack-data==0.5.0
     69 +streamlit==1.13.0
     70 +streamlit-agraph==0.0.42
     71 +toml==0.10.2
     72 +toolz==0.12.0
     73 +tornado==6.2
     74 +traitlets==5.4.0
     75 +typing_extensions==4.3.0
     76 +tzdata==2022.2
     77 +tzlocal==4.2
     78 +urllib3==1.26.12
     79 +validators==0.20.0
     80 +wcwidth==0.2.5
     81 +zipp==3.8.1
     82 +>>>>>>> main
     83 + 
Please wait...
Page is in error, reload to recover