STRLCPY/OnionSearch

Default output filename contains now datetime and search string (can be modified using cli output param)
Gobarigo committed 4 years ago

4951cda0

1 parent 9aa506a5

■ ■ ■ ■ ■ ■

README.md

		skipped 2 lines
3	3		[![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
4	4
5	5		OnionSearch is a Python3 script that scrapes urls on different ".onion" search engines.
	6	+
6	7		In 30 minutes you get thousands of unique urls.
7	8
8	9		## 💡 Prerequisite
		skipped 37 lines
46	47
47	48		```
48	49		usage: search.py [-h] [--proxy PROXY] [--output OUTPUT] [--limit LIMIT]
49		-	[--barmode BARMODE] [--engines [ENGINES [ENGINES ...]]]
50		-	[--exclude [EXCLUDE [EXCLUDE ...]]]
51		-	search
	50	+	[--barmode BARMODE] [--engines [ENGINES [ENGINES ...]]]
	51	+	[--exclude [EXCLUDE [EXCLUDE ...]]]
	52	+	search
52	53
53	54		positional arguments:
54	55		search The search string or phrase
		skipped 1 lines
56	57		optional arguments:
57	58		-h, --help show this help message and exit
58	59		--proxy PROXY Set Tor proxy (default: 127.0.0.1:9050)
59		-	--output OUTPUT Output File (default: output.txt)
	60	+	--output OUTPUT Output File (default: output_$SEARCH_$DATE.txt), where
	61	+	$SEARCH is replaced by the first chars of the search
	62	+	string and $DATE is replaced by the datetime
60	63		--limit LIMIT Set a max number of pages per engine to load
61	64		--barmode BARMODE Can be 'fixed' (default) or 'unknown'
62	65		--engines [ENGINES [ENGINES ...]]
63	66		Engines to request (default: full list)
64	67		--exclude [EXCLUDE [EXCLUDE ...]]
65	68		Engines to exclude (default: none)
	69	+
	70	+	[...]
66	71		```
67	72
68	73		### Examples
69	74
70		-	To request the string "computer" on all the engines to default file:
	75	+	To request all the engines for the word "computer":
71	76		```
72	77		python3 search.py "computer"
73	78		```
74	79
75		-	To request all the engines but "Ahmia" and "Candle":
	80	+	To request all the engines excepted "Ahmia" and "Candle" for the word "computer":
76	81		```
77		-	python3 search.py "computer" --proxy 127.0.0.1:1337 --exclude ahmia candle
	82	+	python3 search.py "computer" --exclude ahmia candle
78	83		```
79	84
80		-	To request only "Tor66", "DeepLink" and "Phobos":
	85	+	To request only "Tor66", "DeepLink" and "Phobos" for the word "computer":
81	86		```
82		-	python3 search.py "computer" --proxy 127.0.0.1:1337 --engines tor66 deeplink phobos
	87	+	python3 search.py "computer" --engines tor66 deeplink phobos
83	88		```
84	89
85		-	The same but limiting the number of page per engine to load to 3:
	90	+	The same as previously but limiting to 3 the number of pages to load per engine:
86	91		```
87		-	python3 search.py "computer" --proxy 127.0.0.1:1337 --engines tor66 deeplink phobos --limit 3
	92	+	python3 search.py "computer" --engines tor66 deeplink phobos --limit 3
88	93		```
89	94
90	95		Please kindly note that the list of supported engines (and their keys) is given in the script help (-h).
91	96
	97	+
92	98		### Output
93	99
94	100		The file written at the end of the process will be a csv containing the following columns:
		skipped 1 lines
96	102		"engine","name of the link","url"
97	103		```
98	104
99		-	The name and url strings are sanitized as much as possible, but there might still be some problems.
	105	+	The filename will be set by default to `output_$DATE_$SEARCH.txt`, where $DATE represents the current datetime and $SEARCH the first
	106	+	characters of the search string.
	107	+
	108	+	You can modify this filename by using `--output` when running the script, for instance:
	109	+	```
	110	+	python3 search.py "computer" --output "\$DATE.csv"
	111	+	python3 search.py "computer" --output output.txt
	112	+	python3 search.py "computer" --output "\$DATE_\$SEARCH.csv"
	113	+	...
	114	+	```
	115	+	(Note that it might be necessary to escape the dollar character.)
	116	+
	117	+	In the csv file produced, the name and url strings are sanitized as much as possible, but there might still be some problems.
100	118
101	119
102	120		## 📝 License
		skipped 4 lines

■ ■ ■ ■ ■ ■

search.py

		skipped 60 lines
61	61
62	62		parser = argparse.ArgumentParser(epilog=print_epilog())
63	63		parser.add_argument("--proxy", default='localhost:9050', type=str, help="Set Tor proxy (default: 127.0.0.1:9050)")
64		-	parser.add_argument("--output", default='output.txt', type=str, help="Output File (default: output.txt)")
	64	+	parser.add_argument("--output", default='output_$SEARCH_$DATE.txt', type=str,
	65	+	help="Output File (default: output_$SEARCH_$DATE.txt), where $SEARCH is replaced by the first "
	66	+	"chars of the search string and $DATE is replaced by the datetime")
65	67		parser.add_argument("search", type=str, help="The search string or phrase")
66	68		parser.add_argument("--limit", type=int, default=0, help="Set a max number of pages per engine to load")
67	69		parser.add_argument("--barmode", type=str, default="fixed", help="Can be 'fixed' (default) or 'unknown'")
		skipped 359 lines
427	429
428	430		with tqdm(total=1, initial=0, desc="%20s" % "Grams", unit="req", ascii=False, ncols=120,
429	431		bar_format=tqdm_bar_format) as progress_bar:
430		-
431	432		resp = s.post(grams_url2, data={"req": searchstr, "_token": token})
432	433		soup = BeautifulSoup(resp.text, 'html.parser')
433	434		link_finder("grams", soup)
		skipped 551 lines
985	986
986	987		start_time = datetime.now()
987	988
	989	+	# Building the filename
	990	+	filename = args.output
	991	+	filename = str(filename).replace("$DATE", start_time.strftime("%Y%m%d%H%M%S"))
	992	+	search = str(args.search).replace(" ", "")
	993	+	if len(search) > 10:
	994	+	search = search[0:9]
	995	+	filename = str(filename).replace("$SEARCH", search)
	996	+
988	997		if args.engines and len(args.engines) > 0:
989	998		engines = args.engines[0]
990	999		for e in engines:
		skipped 13 lines
1004	1013		print("\nReport:")
1005	1014		print(" Execution time: %s seconds" % (stop_time - start_time))
1006	1015
1007		-	f = open(args.output, "w+")
	1016	+	f = open(filename, "w+")
1008	1017		for engine in result.keys():
1009	1018		print(" {}: {}".format(engine, str(len(result[engine]))))
1010	1019		total += len(result[engine])
		skipped 1 lines
1012	1021		f.write("\"{}\",\"{}\",\"{}\"\n".format(engine, i["name"], i["link"]))
1013	1022
1014	1023		f.close()
1015		-	print(" Total: {} links written to {}".format(str(total), args.output))
	1024	+	print(" Total: {} links written to {}".format(str(total), filename))
1016	1025
1017	1026
1018	1027		if __name__ == "__main__":
		skipped 2 lines

Default output filename contains now datetime and search string (can be modified using cli output param)