Projects STRLCPY OnionSearch Commits 4951cda0
🤬
  • Default output filename contains now datetime and search string (can be modified using cli output param)

  • Loading...
  • Gobarigo committed 4 years ago
    4951cda0
    1 parent 9aa506a5
  • ■ ■ ■ ■ ■ ■
    README.md
    skipped 2 lines
    3 3  [![forthebadge made-with-python](http://ForTheBadge.com/images/badges/made-with-python.svg)](https://www.python.org/)
    4 4   
    5 5  OnionSearch is a Python3 script that scrapes urls on different ".onion" search engines.
     6 + 
    6 7  In 30 minutes you get thousands of unique urls.
    7 8   
    8 9  ## 💡 Prerequisite
    skipped 37 lines
    46 47   
    47 48  ```
    48 49  usage: search.py [-h] [--proxy PROXY] [--output OUTPUT] [--limit LIMIT]
    49  - [--barmode BARMODE] [--engines [ENGINES [ENGINES ...]]]
    50  - [--exclude [EXCLUDE [EXCLUDE ...]]]
    51  - search
     50 + [--barmode BARMODE] [--engines [ENGINES [ENGINES ...]]]
     51 + [--exclude [EXCLUDE [EXCLUDE ...]]]
     52 + search
    52 53   
    53 54  positional arguments:
    54 55   search The search string or phrase
    skipped 1 lines
    56 57  optional arguments:
    57 58   -h, --help show this help message and exit
    58 59   --proxy PROXY Set Tor proxy (default: 127.0.0.1:9050)
    59  - --output OUTPUT Output File (default: output.txt)
     60 + --output OUTPUT Output File (default: output_$SEARCH_$DATE.txt), where
     61 + $SEARCH is replaced by the first chars of the search
     62 + string and $DATE is replaced by the datetime
    60 63   --limit LIMIT Set a max number of pages per engine to load
    61 64   --barmode BARMODE Can be 'fixed' (default) or 'unknown'
    62 65   --engines [ENGINES [ENGINES ...]]
    63 66   Engines to request (default: full list)
    64 67   --exclude [EXCLUDE [EXCLUDE ...]]
    65 68   Engines to exclude (default: none)
     69 + 
     70 +[...]
    66 71  ```
    67 72   
    68 73  ### Examples
    69 74   
    70  -To request the string "computer" on all the engines to default file:
     75 +To request all the engines for the word "computer":
    71 76  ```
    72 77  python3 search.py "computer"
    73 78  ```
    74 79   
    75  -To request all the engines but "Ahmia" and "Candle":
     80 +To request all the engines excepted "Ahmia" and "Candle" for the word "computer":
    76 81  ```
    77  -python3 search.py "computer" --proxy 127.0.0.1:1337 --exclude ahmia candle
     82 +python3 search.py "computer" --exclude ahmia candle
    78 83  ```
    79 84   
    80  -To request only "Tor66", "DeepLink" and "Phobos":
     85 +To request only "Tor66", "DeepLink" and "Phobos" for the word "computer":
    81 86  ```
    82  -python3 search.py "computer" --proxy 127.0.0.1:1337 --engines tor66 deeplink phobos
     87 +python3 search.py "computer" --engines tor66 deeplink phobos
    83 88  ```
    84 89   
    85  -The same but limiting the number of page per engine to load to 3:
     90 +The same as previously but limiting to 3 the number of pages to load per engine:
    86 91  ```
    87  -python3 search.py "computer" --proxy 127.0.0.1:1337 --engines tor66 deeplink phobos --limit 3
     92 +python3 search.py "computer" --engines tor66 deeplink phobos --limit 3
    88 93  ```
    89 94   
    90 95  Please kindly note that the list of supported engines (and their keys) is given in the script help (-h).
    91 96   
     97 + 
    92 98  ### Output
    93 99   
    94 100  The file written at the end of the process will be a csv containing the following columns:
    skipped 1 lines
    96 102  "engine","name of the link","url"
    97 103  ```
    98 104   
    99  -The name and url strings are sanitized as much as possible, but there might still be some problems.
     105 +The filename will be set by default to `output_$DATE_$SEARCH.txt`, where $DATE represents the current datetime and $SEARCH the first
     106 +characters of the search string.
     107 + 
     108 +You can modify this filename by using `--output` when running the script, for instance:
     109 +```
     110 +python3 search.py "computer" --output "\$DATE.csv"
     111 +python3 search.py "computer" --output output.txt
     112 +python3 search.py "computer" --output "\$DATE_\$SEARCH.csv"
     113 +...
     114 +```
     115 +(Note that it might be necessary to escape the dollar character.)
     116 + 
     117 +In the csv file produced, the name and url strings are sanitized as much as possible, but there might still be some problems.
    100 118   
    101 119   
    102 120  ## 📝 License
    skipped 4 lines
  • ■ ■ ■ ■ ■
    search.py
    skipped 60 lines
    61 61   
    62 62  parser = argparse.ArgumentParser(epilog=print_epilog())
    63 63  parser.add_argument("--proxy", default='localhost:9050', type=str, help="Set Tor proxy (default: 127.0.0.1:9050)")
    64  -parser.add_argument("--output", default='output.txt', type=str, help="Output File (default: output.txt)")
     64 +parser.add_argument("--output", default='output_$SEARCH_$DATE.txt', type=str,
     65 + help="Output File (default: output_$SEARCH_$DATE.txt), where $SEARCH is replaced by the first "
     66 + "chars of the search string and $DATE is replaced by the datetime")
    65 67  parser.add_argument("search", type=str, help="The search string or phrase")
    66 68  parser.add_argument("--limit", type=int, default=0, help="Set a max number of pages per engine to load")
    67 69  parser.add_argument("--barmode", type=str, default="fixed", help="Can be 'fixed' (default) or 'unknown'")
    skipped 359 lines
    427 429   
    428 430   with tqdm(total=1, initial=0, desc="%20s" % "Grams", unit="req", ascii=False, ncols=120,
    429 431   bar_format=tqdm_bar_format) as progress_bar:
    430  - 
    431 432   resp = s.post(grams_url2, data={"req": searchstr, "_token": token})
    432 433   soup = BeautifulSoup(resp.text, 'html.parser')
    433 434   link_finder("grams", soup)
    skipped 551 lines
    985 986   
    986 987   start_time = datetime.now()
    987 988   
     989 + # Building the filename
     990 + filename = args.output
     991 + filename = str(filename).replace("$DATE", start_time.strftime("%Y%m%d%H%M%S"))
     992 + search = str(args.search).replace(" ", "")
     993 + if len(search) > 10:
     994 + search = search[0:9]
     995 + filename = str(filename).replace("$SEARCH", search)
     996 + 
    988 997   if args.engines and len(args.engines) > 0:
    989 998   engines = args.engines[0]
    990 999   for e in engines:
    skipped 13 lines
    1004 1013   print("\nReport:")
    1005 1014   print(" Execution time: %s seconds" % (stop_time - start_time))
    1006 1015   
    1007  - f = open(args.output, "w+")
     1016 + f = open(filename, "w+")
    1008 1017   for engine in result.keys():
    1009 1018   print(" {}: {}".format(engine, str(len(result[engine]))))
    1010 1019   total += len(result[engine])
    skipped 1 lines
    1012 1021   f.write("\"{}\",\"{}\",\"{}\"\n".format(engine, i["name"], i["link"]))
    1013 1022   
    1014 1023   f.close()
    1015  - print(" Total: {} links written to {}".format(str(total), args.output))
     1024 + print(" Total: {} links written to {}".format(str(total), filename))
    1016 1025   
    1017 1026   
    1018 1027  if __name__ == "__main__":
    skipped 2 lines
Please wait...
Page is in error, reload to recover