STRLCPY/deduplicator

Merge branch 'main' into development
sreedev committed 1 year ago

31d35aae

2 parents
087f36ac
c6318a31

■ ■ ■ ■ ■ ■

.github/workflows/build_and_release.yml

1	-	name: Build and release
2	-
3	-	on:
4	-	push:
5	-	branches:
6	-	- main
7	-	release:
8	-	types: [created]
9	-
10	-	jobs:
11	-	build:
12	-	runs-on: ubuntu-latest
13	-	steps:
14	-	- name: Checkout code
15	-	uses: actions/checkout@v2
16	-
17	-	- name: Setup Rust
18	-	uses: actions-rs/toolchain@v1
19	-	with:
20	-	toolchain: stable
21	-	profile: minimal
22	-	override: true
23	-
24	-	- name: Build for Windows
25	-	run: \|
26	-	cargo build --release --target x86_64-pc-windows-gnu
27	-
28	-	- name: Build for Linux
29	-	run: \|
30	-	cargo build --release --target x86_64-unknown-linux-gnu
31	-
32	-	- name: Build for MacOS
33	-	run: \|
34	-	cargo build --release --target x86_64-apple-darwin
35	-
36	-	- name: Create release
37	-	uses: actions/create-release@v2
38	-	env:
39	-	GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
40	-	with:
41	-	tag_name: ${{ github.ref }}
42	-	release_name: Release ${{ github.ref }}
43	-	draft: false
44	-	prerelease: false
45	-

■ ■ ■ ■ ■ ■

.github/workflows/release.yml

1	+	name: Release
2	+
3	+	env:
4	+	PROJECT_NAME: deduplicator
5	+	PROJECT_DESC: "Filter, Sort & Delete Duplicate Files Recursively"
6	+	PROJECT_AUTH: "sreedevk"
7	+
8	+	on:
9	+	release:
10	+	types:
11	+	- created
12	+
13	+	jobs:
14	+	upload-assets:
15	+	strategy:
16	+	matrix:
17	+	os:
18	+	- ubuntu-latest
19	+	- macos-latest
20	+	- windows-latest
21	+	runs-on: ${{ matrix.os }}
22	+	steps:
23	+	- uses: actions/checkout@v3
24	+	- uses: taiki-e/upload-rust-binary-action@v1
25	+	with:
26	+	bin: deduplicator
27	+	tar: unix
28	+	zip: windows
29	+	token: ${{ secrets.GITHUB_TOKEN }}
30	+	env:
31	+	GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
32	+

■ ■ ■ ■ ■ ■

.gitignore

1 1 /target
2 2 /test_data
3 + .envrc
3 4

All occurrences

■ ■ ■ ■ ■ ■

README.md

		skipped 3 lines
4	4		Find, Sort, Filter & Delete duplicate files
5	5		</p>
6	6
7		-	<p align="center">
8		-	NOTE: This project is still being developed. At the moment, as shown in the screenshot below, deduplicator is able to scan through and list duplicates with and without caching. Contributions are welcome.
9		-	</p>
10		-
11		-	<h2 align="center">Usage</h2>
	7	+	## Usage
12	8
13	9		```bash
14	10		Usage: deduplicator [OPTIONS]
		skipped 7 lines
22	18		-V, --version Print version information
23	19		```
24	20
25		-	<h2 align="center">Installation</h2>
	21	+	## Installation
26	22
27		-	<p align="center">Currently, deduplicator is only installable via rust's cargo package manager</p>
	23	+	### Cargo Install
	24	+
	25	+	#### Stable
28	26
	27	+	```bash
	28	+	$ cargo install deduplicator
29	29		```
30		-	cargo install deduplicator
	30	+
	31	+	#### Nightly
	32	+
	33	+	if you'd like to install with nightly features, you can use
	34	+
	35	+	```bash
	36	+	$ cargo install --git https://github.com/sreedevk/deduplicator
31	37		```
32		-	<p align="center">
33		-	note that if you use a version manager to install rust (like asdf), you need to reshim (`asdf reshim rust`).
34		-	</p>
	38	+	Please note that if you use a version manager to install rust (like asdf), you need to reshim (`asdf reshim rust`).
35	39
36		-	<h2 align="center">Performance</h2>
	40	+	### Linux (Pre-built Binary)
37	41
38		-	<p align="center">
39		-	Deduplicator uses fxhash (a non-cryptographic hashing algorithm) which is extremely fast. As a result, deduplicator is able to process huge amounts of data in a <del>couple of seconds.</del> few milliseconds.</p>
	42	+	you can download the pre-built binary from the [Releases](https://github.com/sreedevk/deduplicator/releases) page.
	43	+	download the `deduplicator-x86_64-unknown-linux-gnu.tar.gz` for linux. Once you have the tarball file with the executable,
	44	+	you can follow these steps to install:
40	45
41		-	<p align="center">
42		-	<del>While testing, Deduplicator was able to go through 8.6GB of pdf files and detect duplicates in 2.9 seconds</del>
43		-	As of version 0.1.1, on testing locally, deduplicator was able to process and find duplicates in 120GB of files (Videos, PDFs, Images) in ~300ms
44		-	</p>
	46	+	```bash
	47	+	$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
	48	+	$ sudo mv deduplicator /usr/bin/
	49	+	```
45	50
46		-	<h2 align="center">Screenshots</h2>
	51	+	### Mac OS (Pre-built Binary)
47	52
48		-	<img src="https://user-images.githubusercontent.com/36154121/213618143-e5182e39-731e-4817-87dd-1a6a0f38a449.gif" />
	53	+	you can download the pre-build binary from the [Releases](https://github.com/sreedevk/deduplicator/releases) page.
	54	+	download the `deduplicator-x86_64-apple-darwin.tar.gz` tarball for mac os. Once you have the tarball file with the executable, you can follow these steps to install:
	55	+
	56	+	```bash
	57	+	$ tar -zxvf deduplicator-x86_64-unknown-linux-gnu.tar.gz
	58	+	$ sudo mv deduplicator /usr/bin/
	59	+	```
	60	+
	61	+	### Windows (Pre-built Binary)
	62	+
	63	+	you can download the pre-build binary from the [Releases](https://github.com/sreedevk/deduplicator/releases) page.
	64	+	download the `deduplicator-x86_64-pc-windows-msvc.zip` zip file for windows. unzip the `zip` file & move the `deduplicator.exe` to a location in the PATH system environment variable.
	65	+
	66	+	Note: If you Run into an msvc error, please install MSCV from [here](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170)
	67	+
	68	+	## Performance
	69	+
	70	+	Deduplicator uses size comparison and fxhash (a non non-cryptographic hashing algo) to quickly scan through large number of files to find duplicates. its also highly parallel (uses rayon and dashmap). I was able to scan through 120GB of files (Videos, PDFs, Images) in ~300ms. checkout the benchmarks
	71	+
	72	+	## benchmarks
	73	+
	74	+	\| Command \| Dirsize \| Mean [ms] \| Min [ms] \| Max [ms] \| Relative \|
	75	+	\|:---\|:---\|---:\|---:\|---:\|---:\|
	76	+	\| `deduplicator --dir ~/Data/tmp` \| (~120G) \| 27.5 ± 1.0 \| 26.0 \| 32.1 \| 1.70 ± 0.09 \|
	77	+	\| `deduplicator --dir ~/Data/books` \| (~8.6G) \| 21.8 ± 0.7 \| 20.5 \| 24.4 \| 1.35 ± 0.07 \|
	78	+	\| `deduplicator --dir ~/Data/books --minsize 10M` \| (~8.6G) \| 16.1 ± 0.6 \| 14.9 \| 18.8 \| 1.00 \|
	79	+	\| `deduplicator --dir ~/Data/ --types pdf,jpg,png,jpeg` \| (~290G) \| 1857.4 ± 24.5 \| 1817.0 \| 1895.5 \| 115.07 ± 4.64 \|
	80	+
	81	+	* The last entry is lower because of the number of files deduplicator had to go through (~660895 Files). The average size of the files rarely affect the performance of deduplicator.
	82	+
	83	+	These benchmarks were run using [hyperfine](https://github.com/sharkdp/hyperfine). Here are the specs of the machine used to benchmark deduplicator:
	84	+
	85	+	```
	86	+	OS: Arch Linux x86_64
	87	+	Host: Precision 5540
	88	+	Kernel: 5.15.89-1-lts
	89	+	Uptime: 4 hours, 44 mins
	90	+	Shell: zsh 5.9
	91	+	Terminal: kitty
	92	+	CPU: Intel i9-9880H (16) @ 4.800GHz
	93	+	GPU: NVIDIA Quadro T2000 Mobile / Max-Q
	94	+	GPU: Intel CoffeeLake-H GT2 [UHD Graphics 630]
	95	+	Memory: 31731MiB (~32GiB)
	96	+	```
	97	+
	98	+	## Screenshots
	99	+
	100	+	![](https://user-images.githubusercontent.com/36154121/213618143-e5182e39-731e-4817-87dd-1a6a0f38a449.gif)
49	101

1	1		/target
2	2		/test_data
	3	+	.envrc
3	4

Merge branch 'main' into development