STRLCPY/451-CorporateRiskMiner

■ ■ ■ ■ ■ ■

requirements.txt

1	1		elementpath
2	2		enum
3	3		csv
4		-
	4	+	requests==2.22.0
	5	+	beautifulsoup4==4.8.1

■ ■ ■ ■ ■ ■

sanctions/README.md

1	-	## Folder with Sanctions lists parsers
2	-
3	-	`source` directory has raw sanction lists (e.g UN and OFAC)
4	-
5	-	`parsed` directory contains parced data
6	-
7	-	### How to use
8	-
9	-	UN sanctions
10	-
11	-	```
12	-	python sanctions/un_parser.py -i "/sanctions/source/un.xml" -o "/sanctions/parsed/un_parsed.csv"
13	-	```
14	-	If doesnt work, try the absolute path

■ ■ ■ ■ ■ ■

sanctions_and_peps/README.md

1	+	## Folder with Sanctions and PEP lists parsers
2	+
3	+	`source` directory has raw sanction lists (e.g UN and OFAC)
4	+
5	+	`parsed` directory contains parced data
6	+
7	+	### How to use
8	+
9	+	UN sanctions:
10	+
11	+	```
12	+	python sanctions/un_parser.py -i "/sanctions_and_peps/source/un.xml" -o "/sanctions_and_peps/parsed/un_parsed.csv"
13	+	```
14	+
15	+	RU BL PEPs:
16	+
17	+	The data are scraped from
18	+
19	+	```
20	+	python sanctions_and_peps/ru_bl_peps_parser.py -o /sanctions_and_peps/parsed/ru_bl_peps_parsed.csv
21	+	```
22	+
23	+	PS! If doesnt work, try the absolute path

sanctions_and_peps/parsed/ru_bl_peps_parsed.csv

Diff is too large to be displayed.

sanctions/parsed/un_parsed.csv sanctions_and_peps/parsed/un_parsed.csv

Content is identical

■ ■ ■ ■ ■ ■

sanctions_and_peps/ru_bl_peps_parser.py

1	+	import requests
2	+	import re
3	+	import csv
4	+	import argparse
5	+
6	+	from bs4 import BeautifulSoup
7	+
8	+
9	+	NAME_PARSER = re.compile(r"\((.*?)\)")
10	+
11	+
12	+	def parse_args():
13	+	parser = argparse.ArgumentParser()
14	+	parser.add_argument("-o", "--out", type=str, required=True)
15	+	return parser.parse_args()
16	+
17	+
18	+	def parse_name(compound_name):
19	+	found_name_en = NAME_PARSER.findall(compound_name)
20	+	if found_name_en:
21	+	name_en = found_name_en[0].strip()
22	+	else:
23	+	name_en = None
24	+	name_ru = compound_name.split("(")[0].strip()
25	+	return name_en, name_ru
26	+
27	+
28	+	def main():
29	+	args = parse_args()
30	+
31	+	url = "https://rupep.org/en/persons_list/"
32	+
33	+	html_text = requests.get(url).text
34	+	soup = BeautifulSoup(html_text, "html.parser")
35	+
36	+	data = soup.find("table", "everything quicksilver_target").findAll("tr")
37	+
38	+	header = [
39	+	"NAME_EN",
40	+	"NAME_RU",
41	+	"DOB",
42	+	"TAXPAYER_NUM",
43	+	"CATEGORY",
44	+	"LAST_POSITION_EN"
45	+	"LAST_POSITION_RU"
46	+	]
47	+
48	+	output = []
49	+	for record in data[1:]:
50	+	parsed_row = [None, None, None, None, None, None, ]
51	+	for i, column in enumerate(record.findAll("td")):
52	+	column = column.text.strip()
53	+	# parsed names
54	+	if i == 0:
55	+	name_en, name_ru = parse_name(column)
56	+	parsed_row[0] = name_en
57	+	parsed_row[1] = name_ru
58	+	# DOB
59	+	elif i == 1:
60	+	if column:
61	+	parsed_row[2] = column
62	+	# taxpayer num
63	+	elif i == 2:
64	+	if column:
65	+	parsed_row[3] = column
66	+	# category
67	+	elif i == 3:
68	+	parsed_row[4] = column
69	+	# parsed position names
70	+	elif i == 4:
71	+	if column:
72	+	name_en, name_ru = parse_name(column)
73	+	parsed_row[4] = name_en
74	+	parsed_row[5] = name_ru
75	+
76	+	output.append(parsed_row)
77	+
78	+	with open(args.out, "w") as f:
79	+	writer = csv.writer(f)
80	+	writer.writerow(header)
81	+	for row in output:
82	+	writer.writerow(row)
83	+
84	+
85	+	if __name__=="__main__":
86	+	main()

sanctions/source/un.xml sanctions_and_peps/source/un.xml

Content is identical

sanctions/un_parser.py sanctions_and_peps/un_parser.py

Content is identical

add ru and bl peps