STRLCPY/451-CorporateRiskMiner

Continued on README
Peter Zatka-Haas committed 2 years ago

5d285f06

1 parent ad223edc

■ ■ ■ ■ ■ ■

.streamlit/config.toml

1 1 [theme]
2 2 base="light"
3 + primaryColor="#F63333"

All occurrences

■ ■ ■ ■ ■ ■

README.md

		skipped 8 lines
9	9
10	10		## Tool Description
11	11
12		-	Financial crime journalists need to dig through complex corporate ownership databases (i.e. databases of companies and the people/companies that control those companies) in order to find potentially interesting people/companies related to financial crime. They face several problems along the way:
13		-	1. It is difficult to search across multiple publicly-available databases (UK Companies House, ICIJ Leaks, VK)
14		-	2. There are multiple ‘risk signatures’ associated with criminal activity (e.g. Cyclical or long-chain ownership, links to sanctions, etc) and different journalists prioritise different kinds of signatures in their investigation.
15		-	3. It is hard to prioritise which corporate ownership structures are more ‘risky’ than others
16		-	4. It is hard to see the visualise corporate ownership with different risk signals
	12	+	Financial crime journalists need to dig through complex corporate ownership databases (i.e. databases of companies and the people/companies that control those companies) in order to find associations to criminal activity. They face several problems along the way:
	13	+	1. It is difficult to search across multiple publicly-available databases (UK Companies House, Sanction lists, ICIJ Leaks, VK)
	14	+	2. There are multiple ‘risk signatures’ associated with criminal activity (e.g. Cyclical or long-chain ownership, links to sanctions, etc) and different journalists prioritise different kinds of signatures in their investigation
	15	+	3. The number of corporate networks is overwhelming, and so it is hard to prioritise which corporate ownership structures are more ‘risky’ than others
17	16
18		-	Corporate Risk Miner is a web app which evaluates different risk signatures of financial crime applied to the UK Companies House (UKCH) corporate ownership networks. These risk signatures include:
19		-	* Cyclic ownership: (to explain.....)
	17	+	451 Corporate Risk Miner allows a user to navigate over different corporate ownership networks extracted from UK Companies House (UKCH) to identify and visualise those exhibiting risk signatures associated with financial crime. Example risk signatures include:
	18	+	* Cyclic ownership: Circular company ownership (e.g. Company A owns Company B which owns Company C which owns Company A)
20	19		* Long-chain ownership: Long chains of corporate ownership (e.g. Person A controls company A. Company A is an officer for Company B. Company B is an officer of company C. etc)
21		-	* Links to tax havens: Corporate networks which involve companies/people associated with tax haven jurisdictions
22		-	* Multi-jurisdictionness: Corporate networsk which span many jurisdictions
23		-	* Presence of proxy directors: Proxy directors are individual people who are registered as a company director but who are likely never involved in the running of the business. These people are often directors for many companies.
	20	+	* Links to tax havens: Corporate networks which involve companies/people associated with tax haven or secrecy jurisdictions
	21	+	* Presence of proxy directors: Proxy directors are individual people who are registered as a company director on paper but who are likely never involved in the running of the business.
24	22		* Links to sanctioned entities: Official sanctioned people or companies, from sources such as the UN Sanctions List.
25	23		* Links to politically-exposed persons (PEPs)
26	24		* Links to disqualified directors
27	25
28		-	The user can customise the relative 'importance' of each risk signature for their search. For example one user may rate 'cyclic ownership' as a less important feature than 'association with tax havens' in flagging up potentially dodgy corporate networks. One the user chooses their signature preferences, the app generates a risk score associated with each corporate network and displays the structure of those networks with the highest risk scores.
	26	+	The user can customise the relative importance of each risk signature for their search. The app then computes a total risk score for each corporate network in UKCH, and outlines the details of the most high-risk networks. The user can export these network results as a .csv file for later viewing.
29	27
30	28		## Installation
31	29
		skipped 27 lines
59	57		- Any limitations of the current implementation of the tool
60	58		- Motivation for design/architecture decisions
61	59
	60	+	### Limitations
	61	+	* Limited to cliques of ??? hop distance owing to space limitation
	62	+	* Cyclicity calculation assumes an undirected graph to save computational time. This could be improved by taking into account specific directions of ownership.
	63	+	* Entity resolution for company/people entities could be improved
	64	+	* Graph visualisation for large corporate networks can be too cluttered to be useful.
	65	+
	66	+	### Potential next steps
	67	+	* Expand to corporate ownership databases outside of the UK, for example using OpenCorporates data.
	68	+	* Incorporate more external data sources identifying criminal or potentially-criminal activity for companies and people.
	69	+	*
	70	+

■ ■ ■ ■ ■ ■

app/Untitled-1.ipynb

1	-	{
2	-	"cells": [
3	-	{
4	-	"cell_type": "code",
5	-	"execution_count": 1,
6	-	"metadata": {},
7	-	"outputs": [],
8	-	"source": [
9	-	"import pandas as pd\n",
10	-	"import polars as pl"
11	-	]
12	-	},
13	-	{
14	-	"cell_type": "code",
15	-	"execution_count": null,
16	-	"metadata": {},
17	-	"outputs": [],
18	-	"source": [
19	-	"aws s3 cp s3://ca-amt-df-playground-1-sagemaker-notebook-bucket-eu-west-1/hackathlon/nodes.parquet/ . --recursive "
20	-	]
21	-	},
22	-	{
23	-	"cell_type": "code",
24	-	"execution_count": 13,
25	-	"metadata": {},
26	-	"outputs": [
27	-	{
28	-	"data": {
29	-	"text/html": [
30	-	"<div>\n",
31	-	"<style scoped>\n",
32	-	" .dataframe tbody tr th:only-of-type {\n",
33	-	" vertical-align: middle;\n",
34	-	" }\n",
35	-	"\n",
36	-	" .dataframe tbody tr th {\n",
37	-	" vertical-align: top;\n",
38	-	" }\n",
39	-	"\n",
40	-	" .dataframe thead th {\n",
41	-	" text-align: right;\n",
42	-	" }\n",
43	-	"</style>\n",
44	-	"<table border=\"1\" class=\"dataframe\">\n",
45	-	" <thead>\n",
46	-	" <tr style=\"text-align: right;\">\n",
47	-	" <th></th>\n",
48	-	" <th>cyclicity</th>\n",
49	-	" <th>node_num</th>\n",
50	-	" </tr>\n",
51	-	" <tr>\n",
52	-	" <th>network_id</th>\n",
53	-	" <th></th>\n",
54	-	" <th></th>\n",
55	-	" </tr>\n",
56	-	" </thead>\n",
57	-	" <tbody>\n",
58	-	" <tr>\n",
59	-	" <th>28587319354</th>\n",
60	-	" <td>0.000000</td>\n",
61	-	" <td>9</td>\n",
62	-	" </tr>\n",
63	-	" <tr>\n",
64	-	" <th>19180640338</th>\n",
65	-	" <td>0.000000</td>\n",
66	-	" <td>7</td>\n",
67	-	" </tr>\n",
68	-	" <tr>\n",
69	-	" <th>29711988418</th>\n",
70	-	" <td>0.000000</td>\n",
71	-	" <td>9</td>\n",
72	-	" </tr>\n",
73	-	" <tr>\n",
74	-	" <th>30146913753</th>\n",
75	-	" <td>0.000000</td>\n",
76	-	" <td>17</td>\n",
77	-	" </tr>\n",
78	-	" <tr>\n",
79	-	" <th>41943095593</th>\n",
80	-	" <td>0.833333</td>\n",
81	-	" <td>18</td>\n",
82	-	" </tr>\n",
83	-	" <tr>\n",
84	-	" <th>...</th>\n",
85	-	" <td>...</td>\n",
86	-	" <td>...</td>\n",
87	-	" </tr>\n",
88	-	" <tr>\n",
89	-	" <th>10546100446</th>\n",
90	-	" <td>0.400000</td>\n",
91	-	" <td>5</td>\n",
92	-	" </tr>\n",
93	-	" <tr>\n",
94	-	" <th>12286972756</th>\n",
95	-	" <td>0.000000</td>\n",
96	-	" <td>7</td>\n",
97	-	" </tr>\n",
98	-	" <tr>\n",
99	-	" <th>20667544820</th>\n",
100	-	" <td>0.100000</td>\n",
101	-	" <td>10</td>\n",
102	-	" </tr>\n",
103	-	" <tr>\n",
104	-	" <th>6838088944</th>\n",
105	-	" <td>0.000000</td>\n",
106	-	" <td>3</td>\n",
107	-	" </tr>\n",
108	-	" <tr>\n",
109	-	" <th>22044908124</th>\n",
110	-	" <td>0.272727</td>\n",
111	-	" <td>11</td>\n",
112	-	" </tr>\n",
113	-	" </tbody>\n",
114	-	"</table>\n",
115	-	"<p>100000 rows × 2 columns</p>\n",
116	-	"</div>"
117	-	],
118	-	"text/plain": [
119	-	" cyclicity node_num\n",
120	-	"network_id \n",
121	-	"28587319354 0.000000 9\n",
122	-	"19180640338 0.000000 7\n",
123	-	"29711988418 0.000000 9\n",
124	-	"30146913753 0.000000 17\n",
125	-	"41943095593 0.833333 18\n",
126	-	"... ... ...\n",
127	-	"10546100446 0.400000 5\n",
128	-	"12286972756 0.000000 7\n",
129	-	"20667544820 0.100000 10\n",
130	-	"6838088944 0.000000 3\n",
131	-	"22044908124 0.272727 11\n",
132	-	"\n",
133	-	"[100000 rows x 2 columns]"
134	-	]
135	-	},
136	-	"execution_count": 13,
137	-	"metadata": {},
138	-	"output_type": "execute_result"
139	-	}
140	-	],
141	-	"source": [
142	-	"pd.read_parquet(\"./data/network.parquet\").set_index(\"network_id\")"
143	-	]
144	-	}
145	-	],
146	-	"metadata": {
147	-	"kernelspec": {
148	-	"display_name": "Python 3.10.1 ('.venv': venv)",
149	-	"language": "python",
150	-	"name": "python3"
151	-	},
152	-	"language_info": {
153	-	"codemirror_mode": {
154	-	"name": "ipython",
155	-	"version": 3
156	-	},
157	-	"file_extension": ".py",
158	-	"mimetype": "text/x-python",
159	-	"name": "python",
160	-	"nbconvert_exporter": "python",
161	-	"pygments_lexer": "ipython3",
162	-	"version": "3.10.1"
163	-	},
164	-	"orig_nbformat": 4,
165	-	"vscode": {
166	-	"interpreter": {
167	-	"hash": "7df5df506cb5a387f46ba54efbbd2d65ccf2196d092f81edeb09eadb2dc38463"
168	-	}
169	-	}
170	-	},
171	-	"nbformat": 4,
172	-	"nbformat_minor": 2
173	-	}
174	-

app/__pycache__/utils.cpython-310.pyc

Binary file.
app/__pycache__/utils.cpython-38.pyc

Binary file.

■ ■ ■ ■ ■ ■

app/app.py

1	1		import json
	2	+	import pandas as pd
2	3		import streamlit as st
3	4		from streamlit_agraph import agraph, Config
4	5		from utils import (
		skipped 14 lines
19	20		SLIDER_DEFAULT = 50
20	21		DEFAULT_NUM_SUBGRAPHS_TO_SHOW = 3
21	22		GRAPH_PLOT_HEIGHT_PX = 400
22		-	GRAPH_SIZE_RENDER_LIMIT = 30
	23	+	GRAPH_SIZE_RENDER_LIMIT = 40
23	24		subgraphs = get_subgraph_df()
24	25
25	26		with st.sidebar:
26		-	st.title("Corporate risks")
	27	+	st.title("451 Corporate Risk Miner")
27	28
28	29		weight_chains = (
29	30		st.slider(
		skipped 1 lines
31	32		min_value=SLIDER_MIN,
32	33		max_value=SLIDER_MAX,
33	34		value=SLIDER_DEFAULT,
	35	+	disabled=True,
34	36		)
35	37		/ SLIDER_MAX
36	38		)
		skipped 12 lines
49	51		min_value=SLIDER_MIN,
50	52		max_value=SLIDER_MAX,
51	53		value=SLIDER_DEFAULT,
	54	+	disabled=True,
52	55		)
53	56		/ SLIDER_MAX
54	57		)
		skipped 3 lines
58	61		min_value=SLIDER_MIN,
59	62		max_value=SLIDER_MAX,
60	63		value=SLIDER_DEFAULT,
	64	+	disabled=True,
61	65		)
62	66		/ SLIDER_MAX
63	67		)
		skipped 3 lines
67	71		min_value=SLIDER_MIN,
68	72		max_value=SLIDER_MAX,
69	73		value=SLIDER_DEFAULT,
	74	+	disabled=True,
70	75		)
71	76		/ SLIDER_MAX
72	77		)
		skipped 3 lines
76	81		min_value=SLIDER_MIN,
77	82		max_value=SLIDER_MAX,
78	83		value=SLIDER_DEFAULT,
	84	+	disabled=True,
79	85		)
80	86		/ SLIDER_MAX
81	87		)
82		-	# custom_names_a = st.multiselect(
83		-	# label="Custom persons of interest",
84		-	# options=nodes["node_id"],
85		-	# default=None,
86		-	# )
87		-	custom_names_b = st.file_uploader(label="Custom persons of interest", type="csv")
	88	+
	89	+	custom_names = st.file_uploader(
	90	+	label="Custom persons/companies of interest", type="csv"
	91	+	)
	92	+
	93	+	if custom_names:
	94	+	custom_names = pd.read_csv(custom_names, header=None)[0].tolist()
	95	+	st.write(custom_names)
88	96
89	97		go = st.button("Go")
90	98
		skipped 42 lines
133	141		config=Config(
134	142		width=round(1080 / num_subgraphs_to_display),
135	143		height=GRAPH_PLOT_HEIGHT_PX,
	144	+	nodeHighlightBehavior=True,
	145	+	highlightColor="#F7A7A6",
	146	+	directed=True,
	147	+	collapsible=True,
136	148		),
137	149		)
138	150		else:
139	151		st.error("Subgraph is too large to render")
140	152
141		-	st.write(nodes_selected)
142		-	# # Build markdown strings for representing metadata
143		-	# markdown_strings = build_markdown_strings_for_node(nodes_selected)
	153	+	# Build markdown strings for representing metadata
	154	+	markdown_strings = build_markdown_strings_for_node(nodes_selected)
144	155
145		-	# st.markdown(":busts_in_silhouette: People")
146		-	# for p in markdown_strings["people"]:
147		-	# if ("SANCTIONED" in p) or ("PEP" in p):
148		-	# st.markdown(p)
149		-	# else:
150		-	# st.markdown(p)
	156	+	st.markdown(":busts_in_silhouette: People")
	157	+	for p in markdown_strings["people"]:
	158	+	st.markdown(p)
151	159
152		-	# st.markdown(":office: Companies")
153		-	# for c in markdown_strings["companies"]:
154		-	# if ("SANCTIONED" in c) or ("PEP" in c):
155		-	# st.markdown(c)
156		-	# else:
157		-	# st.markdown(c)
	160	+	st.markdown(":office: Companies")
	161	+	for c in markdown_strings["companies"]:
	162	+	st.markdown(c)
158	163

■ ■ ■ ■ ■ ■

app/utils.py

		skipped 3 lines
4	4		import pandas as pd
5	5
6	6		NODE_COLOUR_NON_DODGY = "#72EF77"
7		-	NODE_COLOUR_DODGY = "#EF7272"
	7	+	NODE_COLOUR_DODGY = "#F63333"
8	8		NODE_IMAGE_PERSON = "http://i.ibb.co/LrY3tfw/747376.png" # https://www.flaticon.com/free-icon/user_747376
9	9		NODE_IMAGE_COMPANY = "http://i.ibb.co/fx6r1dZ/4812244.png" # https://www.flaticon.com/free-icon/company_4812244
10	10
		skipped 52 lines
63	63		node_objects.append(
64	64		Node(
65	65		id=row["node_id"],
66		-	label=row["node_id"].split("\|")[0],
67		-	size=30,
	66	+	label="\n".join(row["node_id"].split("\|")[0].split(" ")),
	67	+	size=20,
68	68		# color=NODE_COLOUR_DODGY
69	69		# if (row["pep"] > 0 or row["sanction"] > 0)
70	70		# else NODE_COLOUR_NON_DODGY,
71		-	# image=NODE_IMAGE_PERSON
	71	+	image=NODE_IMAGE_PERSON,
72	72		# if row["is_person"] == 1
73	73		# else NODE_IMAGE_COMPANY,
74		-	# shape="circularImage",
75		-	shape="circle",
	74	+	shape="circularImage",
76	75		)
77	76		)
78	77
		skipped 1 lines
80	79		edge_objects.append(
81	80		Edge(
82	81		source=row["source"],
83		-	label=row["type"],
	82	+	# label=row["type"][0],
84	83		target=row["target"],
85	84		)
86	85		)
		skipped 9 lines
96	95		markdown_strings["people"] = []
97	96
98	97		for _, row in nodes_selected.iterrows():
99		-	node_metadata = json.loads(row["node_metadata"])
100		-	node_sanctions = (
101		-	"" if row["sanction"] == 0 else f"! SANCTIONED: {row['sanction_metadata']}"
102		-	)
103		-	node_pep = "" if row["pep"] == 0 else f"! PEP: {row['pep_metadata']}"
	98	+	node_metadata = {
	99	+	"name": row["node_id"],
	100	+	"is_proxy": row["proxy_dir"],
	101	+	"is_person": True,
	102	+	}
104	103
105		-	if row["is_person"] == 1:
106		-	node_title = f"{node_metadata['name']} [{node_metadata['nationality']}/{node_metadata['yob']}/{node_metadata['mob']}]"
	104	+	# node_metadata = json.loads(row["node_metadata"])
	105	+	# node_sanctions = (
	106	+	# "" if row["sanction"] == 0 else f"! SANCTIONED: {row['sanction_metadata']}"
	107	+	# )
	108	+	# node_pep = "" if row["pep"] == 0 else f"! PEP: {row['pep_metadata']}"
	109	+
	110	+	node_sanctions = ""
	111	+	node_pep = ""
	112	+
	113	+	if node_metadata["is_person"]:
	114	+	# node_title = f"{node_metadata['name']} [{node_metadata['nationality']}/{node_metadata['yob']}/{node_metadata['mob']}]"
	115	+	node_title = f"{node_metadata['name']}"
107	116		key = "people"
108	117		else:
109		-	node_title = f"{node_metadata['name']} [{row['jur']}/{node_metadata['reg']}/{node_metadata['address']}]"
	118	+	# node_title = f"{node_metadata['name']} [{row['jur']}/{node_metadata['reg']}/{node_metadata['address']}]"
	119	+	node_title = f"{node_metadata['name']}"
110	120		key = "companies"
111	121
112	122		markdown_strings[key].append(
		skipped 7 lines

1	1		[theme]
2	2		base="light"
	3	+	primaryColor="#F63333"

Continued on README