STRLCPY/OnionSearch

Adding many options to alter the csv file generated (see the README)
Gobarigo committed 4 years ago

6ab41840

1 parent 2ba47d76

■ ■ ■ ■ ■ ■

README.md

		skipped 44 lines
45	45
46	46		## 📈 Usage
47	47
	48	+	Help:
48	49		```
49	50		usage: search.py [-h] [--proxy PROXY] [--output OUTPUT]
50	51		[--continuous_write CONTINUOUS_WRITE] [--limit LIMIT]
51	52		[--barmode BARMODE] [--engines [ENGINES [ENGINES ...]]]
52	53		[--exclude [EXCLUDE [EXCLUDE ...]]]
	54	+	[--fields [FIELDS [FIELDS ...]]]
	55	+	[--field_delimiter FIELD_DELIMITER]
53	56		search
54	57
55	58		positional arguments:
		skipped 2 lines
58	61		optional arguments:
59	62		-h, --help show this help message and exit
60	63		--proxy PROXY Set Tor proxy (default: 127.0.0.1:9050)
61		-	--output OUTPUT Output File (default: output_$SEARCH_$DATE.txt), where
62		-	$SEARCH is replaced by the first chars of the search
63		-	string and $DATE is replaced by the datetime
	64	+	--output OUTPUT Output File (default: output_$SEARCH_$DATE.txt), where $SEARCH is replaced by the first chars of the search string and $DATE is replaced by the datetime
64	65		--continuous_write CONTINUOUS_WRITE
65	66		Write progressively to output file (default: False)
66	67		--limit LIMIT Set a max number of pages per engine to load
		skipped 2 lines
69	70		Engines to request (default: full list)
70	71		--exclude [EXCLUDE [EXCLUDE ...]]
71	72		Engines to exclude (default: none)
	73	+	--fields [FIELDS [FIELDS ...]]
	74	+	Fields to output to csv file (default: engine name link), available fields are shown below
	75	+	--field_delimiter FIELD_DELIMITER
	76	+	Delimiter for the CSV fields
	77	+
72	78		[...]
73	79		```
74	80
		skipped 24 lines
99	105
100	106		### Output
101	107
102		-	The file written at the end of the process will be a csv containing the following columns:
	108	+	#### Default output
	109	+
	110	+	By default, the file is written at the end of the process. The file will be csv formatted, containing the following columns:
103	111		```
104	112		"engine","name of the link","url"
105	113		```
106	114
	115	+	#### Customizing the output fields
	116	+
	117	+	You can customize what will be flush in the output file by using the parameters `--fields` and `--field_delimiter`.
	118	+
	119	+	`--fields` allows you to add, remove, re-order the output fields. The default mode is show just below. Instead, you can for instance
	120	+	choose to output:
	121	+	```
	122	+	"engine","name of the link","url","domain"
	123	+	```
	124	+	by setting `--fields engine name link domain`.
	125	+
	126	+	Or even, you can choose to output:
	127	+	```
	128	+	"engine","domain"
	129	+	```
	130	+	by setting `--fields engine domain`.
	131	+
	132	+	These are examples but there are many possibilities.
	133	+
	134	+	Finally, you can also choose to modify the CSV delimiter (comma by default), for instance: `--field_delimiter ";"`.
	135	+
	136	+	#### Changing filename
	137	+
107	138		The filename will be set by default to `output_$DATE_$SEARCH.txt`, where $DATE represents the current datetime and $SEARCH the first
108	139		characters of the search string.
109	140
		skipped 6 lines
116	147		```
117	148		(Note that it might be necessary to escape the dollar character.)
118	149
119		-	In the csv file produced, the name and url strings are sanitized as much as possible, but there might still be some problems.
	150	+	In the csv file produced, the name and url strings are sanitized as much as possible, but there might still be some problems...
120	151
121		-	Note that you can choose to progressively write to the output (instead of everything at the end, which would prevent
	152	+	#### Write progressively
	153	+
	154	+	You can choose to progressively write to the output (instead of everything at the end, which would prevent
122	155		losing the results if something goes wrong). To do so you have to use `--continuous-write True`, just as is:
123	156		```
124	157		python3 search.py "computer" --continuous-write True
		skipped 8 lines

■ ■ ■ ■ ■ ■

search.py

1	1		import argparse
	2	+	import csv
2	3		import math
3	4		import re
4	5		import time
		skipped 46 lines
51	52		"deeplink",
52	53		]
53	54
	55	+	available_csv_fields = [
	56	+	"engine",
	57	+	"name",
	58	+	"link",
	59	+	"domain"
	60	+	# Todo: add description, but needs modify scraping (link_finder func) for all the engines
	61	+	]
	62	+
54	63
55	64		def print_epilog():
56		-	epilog = "Supported engines: ".format(len(supported_engines))
	65	+	epilog = "Available CSV fields: \n\t"
	66	+	for f in available_csv_fields:
	67	+	epilog += " {}".format(f)
	68	+	epilog += "\n"
	69	+	epilog += "Supported engines: \n\t"
57	70		for e in supported_engines:
58	71		epilog += " {}".format(e)
59	72		return epilog
60	73
61	74
62		-	parser = argparse.ArgumentParser(epilog=print_epilog())
	75	+	parser = argparse.ArgumentParser(epilog=print_epilog(), formatter_class=argparse.RawTextHelpFormatter)
63	76		parser.add_argument("--proxy", default='localhost:9050', type=str, help="Set Tor proxy (default: 127.0.0.1:9050)")
64	77		parser.add_argument("--output", default='output_$SEARCH_$DATE.txt', type=str,
65	78		help="Output File (default: output_$SEARCH_$DATE.txt), where $SEARCH is replaced by the first "
		skipped 5 lines
71	84		parser.add_argument("--barmode", type=str, default="fixed", help="Can be 'fixed' (default) or 'unknown'")
72	85		parser.add_argument("--engines", type=str, action='append', help='Engines to request (default: full list)', nargs="*")
73	86		parser.add_argument("--exclude", type=str, action='append', help='Engines to exclude (default: none)', nargs="*")
	87	+	parser.add_argument("--fields", type=str, action='append',
	88	+	help='Fields to output to csv file (default: engine name link), available fields are shown below',
	89	+	nargs="*")
	90	+	parser.add_argument("--field_delimiter", type=str, default=",", help='Delimiter for the CSV fields')
74	91
75	92		args = parser.parse_args()
76	93		proxies = {'http': 'socks5h://{}'.format(args.proxy), 'https': 'socks5h://{}'.format(args.proxy)}
77	94		tqdm_bar_format = "{desc}: {percentage:3.0f}% \|{bar}\| {n_fmt:3s} / {total_fmt:3s} [{elapsed:5s} < {remaining:5s}]"
78	95		result = {}
79	96		filename = args.output
80		-
	97	+	field_delim = ","
	98	+	if args.field_delimiter and len(args.field_delimiter) == 1:
	99	+	field_delim = args.field_delimiter
81	100
82	101		def random_headers():
83	102		return {'User-Agent': choice(desktop_agents),
		skipped 700 lines
784	803		progress_bar.close()
785	804
786	805
	806	+	def get_domain_from_url(link):
	807	+	fqdn_re = r"^[a-z][a-z0-9+\-.]://([a-z0-9\-._~%!$&'()+,;=]+@)?([a-z0-9\-._~%]+\|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])"
	808	+	domain_re = re.match(fqdn_re, link)
	809	+	if domain_re is not None:
	810	+	if domain_re.lastindex == 2:
	811	+	return domain_re.group(2)
	812	+	return None
	813	+
	814	+
	815	+	def write_to_csv(csv_writer, fields):
	816	+	line_to_write = []
	817	+	if args.fields and len(args.fields) > 0:
	818	+	for f in args.fields[0]:
	819	+	if f in fields:
	820	+	line_to_write.append(fields[f])
	821	+	if f == "domain":
	822	+	domain = get_domain_from_url(fields['link'])
	823	+	line_to_write.append(domain)
	824	+	csv_writer.writerow(line_to_write)
	825	+	else:
	826	+	# Default output mode
	827	+	line_to_write.append(fields['engine'])
	828	+	line_to_write.append(fields['name'])
	829	+	line_to_write.append(fields['link'])
	830	+	csv_writer.writerow(line_to_write)
	831	+
	832	+
787	833		def link_finder(engine_str, data_obj):
788	834		global result
789	835		global filename
790	836		name = ""
791	837		link = ""
792		-	f = None
	838	+	csv_file = None
	839	+	has_result = False
793	840
794	841		if args.continuous_write:
795		-	f = open(filename, "a")
	842	+	csv_file = open(filename, 'a', newline='')
796	843
797	844		def append_link():
	845	+	nonlocal has_result
	846	+	has_result = True
	847	+
798	848		result[engine_str].append({"name": name, "link": link})
799		-	if args.continuous_write and f.writable():
800		-	f.write("\"{}\",\"{}\",\"{}\"\n".format(engine_str, name, link))
	849	+
	850	+	if args.continuous_write and csv_file.writable():
	851	+	csv_writer = csv.writer(csv_file, delimiter=field_delim, quoting=csv.QUOTE_NONNUMERIC)
	852	+	fields = {"engine": engine_str, "name": name, "link": link}
	853	+	write_to_csv(csv_writer, fields)
801	854
802	855		if engine_str not in result:
803	856		result[engine_str] = []
		skipped 7 lines
811	864		append_link()
812	865
813	866		if engine_str == "candle":
814		-	for i in data_obj.find('html').find_all('a'):
815		-	if str(i['href']).startswith("http"):
816		-	name = clear(i.get_text())
817		-	link = clear(i['href'])
818		-	append_link()
	867	+	html_page = data_obj.find('html')
	868	+	if html_page:
	869	+	for i in data_obj.find('html').find_all('a'):
	870	+	if str(i['href']).startswith("http"):
	871	+	name = clear(i.get_text())
	872	+	link = clear(i['href'])
	873	+	append_link()
819	874
820	875		if engine_str == "darksearchenginer":
821	876		for i in data_obj.find('div', attrs={"class": "table-responsive"}).find_all('a'):
		skipped 153 lines
975	1030		link = n.find('a')['href']
976	1031		append_link()
977	1032
978		-	if args.continuous_write and not f.closed:
979		-	f.close()
	1033	+	if args.continuous_write and not csv_file.closed:
	1034	+	csv_file.close()
980	1035
981		-	if len(result[engine_str]) <= 0:
	1036	+	if not has_result:
982	1037		return -1
983	1038
984	1039		return 1
		skipped 6 lines
991	1046		print("Error: unable to connect")
992	1047		except OSError:
993	1048		print("Error: unable to connect")
994		-
995		-
996		-	def write_to_file(filename, results, engine):
997		-	f = open(filename, "w+")
998		-	for i in results[engine]:
999		-	f.write("\"{}\",\"{}\",\"{}\"\n".format(engine, i["name"], i["link"]))
1000		-	f.close()
1001	1049
1002	1050
1003	1051		def scrape():
		skipped 25 lines
1029	1077		stop_time = datetime.now()
1030	1078
1031	1079		if not args.continuous_write:
1032		-	f = open(filename, "w+")
1033		-	for engine in result.keys():
1034		-	for i in result[engine]:
1035		-	f.write("\"{}\",\"{}\",\"{}\"\n".format(engine, i["name"], i["link"]))
1036		-	f.close()
	1080	+	with open(filename, 'w', newline='') as csv_file:
	1081	+	csv_writer = csv.writer(csv_file, delimiter=field_delim, quoting=csv.QUOTE_NONNUMERIC)
	1082	+	for engine in result.keys():
	1083	+	for i in result[engine]:
	1084	+	i['engine'] = engine
	1085	+	write_to_csv(csv_writer, i)
1037	1086
1038	1087		total = 0
1039	1088		print("\nReport:")
		skipped 10 lines

Adding many options to alter the csv file generated (see the README)