Speed up cataloging by replacing globs searching with index lookups (#1510)
* replace raw globs with index equivelent operations
Signed-off-by: Alex Goodman <[email protected]>
* add cataloger test for alpm cataloger
Signed-off-by: Alex Goodman <[email protected]>
* fix import sorting for binary cataloger
Signed-off-by: Alex Goodman <[email protected]>
* fix linting for mock resolver
Signed-off-by: Alex Goodman <[email protected]>
* separate portage cataloger parser impl from cataloger
Signed-off-by: Alex Goodman <[email protected]>
* enhance cataloger pkgtest utils to account for resolver responses
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for alpm cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for apkdb cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for dpkg cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for cpp cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for dart cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for dotnet cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for elixir cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for erlang cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for golang cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for haskell cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for java cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for javascript cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for php cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for portage cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for python cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for rpm cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for rust cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for sbom cataloger
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for swift cataloger
Signed-off-by: Alex Goodman <[email protected]>
* allow generic catloger to run all mimetype searches at once
Signed-off-by: Alex Goodman <[email protected]>
* remove stutter from php and javascript cataloger constructors
Signed-off-by: Alex Goodman <[email protected]>
* bump stereoscope
Signed-off-by: Alex Goodman <[email protected]>
* add tests for generic.Search
Signed-off-by: Alex Goodman <[email protected]>
* add exceptions for java archive git ignore entries
Signed-off-by: Alex Goodman <[email protected]>
* enhance basename and extension resolver methods to be variadic
Signed-off-by: Alex Goodman <[email protected]>
* dont allow * prefix on extension searches
Signed-off-by: Alex Goodman <[email protected]>
* add glob-based cataloger tests for ruby cataloger
Signed-off-by: Alex Goodman <[email protected]>
* remove unnecessary string casting
Signed-off-by: Alex Goodman <[email protected]>
* incorporate surfacing of leaf link resolitions from stereoscope results
Signed-off-by: Alex Goodman <[email protected]>
* [wip] switch to stereoscope file metadata
Signed-off-by: Alex Goodman <[email protected]>
* [wip + failing] revert to old globs but keep new resolvers
Signed-off-by: Alex Goodman <[email protected]>
* index files, links, and dirs within the directory resolver
Signed-off-by: Alex Goodman <[email protected]>
* fix several resolver bugs and inconsistencies
Signed-off-by: Alex Goodman <[email protected]>
* move format testutils to internal package
Signed-off-by: Alex Goodman <[email protected]>
* update syft json to account for file type string normalization
Signed-off-by: Alex Goodman <[email protected]>
* split up directory resolver from indexing
Signed-off-by: Alex Goodman <[email protected]>
* update docs to include details about searching
Signed-off-by: Alex Goodman <[email protected]>
* [wip] bump stereoscope to development version
Signed-off-by: Alex Goodman <[email protected]>
* fix linting
Signed-off-by: Alex Goodman <[email protected]>
* adjust symlinks fixture to be fixed to digest
Signed-off-by: Alex Goodman <[email protected]>
* fix all-locations resolver tests
Signed-off-by: Alex Goodman <[email protected]>
* fix test fixture reference
Signed-off-by: Alex Goodman <[email protected]>
* rename file.Type
Signed-off-by: Alex Goodman <[email protected]>
* bump stereoscope
Signed-off-by: Alex Goodman <[email protected]>
* fix PR comment to exclude extra *
Signed-off-by: Alex Goodman <[email protected]>
* bump to dev version of stereoscope
Signed-off-by: Alex Goodman <[email protected]>
* bump to final version of stereoscope
Signed-off-by: Alex Goodman <[email protected]>
* move observing resolver to pkgtest
Signed-off-by: Alex Goodman <[email protected]>
---------
Signed-off-by: Alex Goodman <[email protected]>
Catalogers are the way in which syft is able to identify and construct packages given some amount of source metadata.
120
120
For example, Syft can locate and process `package-lock.json` files when performing filesystem scans.
121
-
See: [how to specify file globs](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger/javascript/cataloger.go#L16-L21)
122
-
and an implementation of the [package-lock.json parser](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger/javascript/cataloger.go#L16-L21) fora quick review.
121
+
See: [how to specify file globs](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger/javascript/cataloger.go#L16-L21)
122
+
and an implementation of the [package-lock.json parser](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger/javascript/cataloger.go#L16-L21) fora quick review.
123
123
124
124
#### Building a new Cataloger
125
125
126
-
Catalogers must fulfill the interface [found here](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger.go).
126
+
Catalogers must fulfill the interface [found here](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger.go).
127
127
This means that when building a new cataloger, the new struct must implement both method signatures of `Catalog` and `Name`.
128
128
129
-
A top level view of the functions that construct all the catalogers can be found [here](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger/cataloger.go).
129
+
A top level view of the functions that construct all the catalogers can be found [here](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger/cataloger.go).
130
130
When an author has finished writing a new cataloger this is the spot to plug in the new catalog constructor.
131
131
132
-
For a top level view of how the catalogers are used see [this function](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger/catalog.go#L41-L100) as a reference. It ranges over all catalogers passed as an argument and invokes the `Catalog` method:
132
+
For a top level view of how the catalogers are used see [this function](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger/catalog.go#L41-L100) as a reference. It ranges over all catalogers passed as an argument and invokes the `Catalog` method:
133
133
134
134
Each cataloger has its own `Catalog` method, but this does not mean that they are all vastly different.
135
-
Take a look at the `apkdb` cataloger for alpine to see how it [constructs a generic.NewCataloger](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger/apkdb/cataloger.go).
135
+
Take a look at the `apkdb` cataloger for alpine to see how it [constructs a generic.NewCataloger](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger/apkdb/cataloger.go).
136
136
137
137
`generic.NewCataloger` is an abstraction syft uses to make writing common components easier. First, it takes the `catalogerName` to identify the cataloger.
138
138
On the other side of the call it uses two key pieces which inform the cataloger how to identify and return packages, the `globPatterns` and the `parseFunction`:
139
139
- The first piece is a `parseByGlob` matching pattern used to identify the files that contain the package metadata.
140
-
See [here for the APK example](https://github.com/anchore/syft/blob/main/syft/pkg/apk_metadata.go#L16-L41).
140
+
See [here for the APK example](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/apk_metadata.go#L16-L41).
141
141
- The other is a `parseFunction` which informs the cataloger what to do when it has found one of the above matches files.
142
-
See this [link for an example](https://github.com/anchore/syft/blob/main/syft/pkg/cataloger/apkdb/parse_apk_db.go#L22-L102).
142
+
See this [link for an example](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/cataloger/apkdb/parse_apk_db.go#L22-L102).
143
143
144
144
If you're unsure about using the `Generic Cataloger` and think the use case being filled requires something more custom
145
145
just file an issue or ask in our slack, and we'd be more than happy to help on the design.
146
146
147
-
Identified packages share a common struct so be sure that when the new cataloger is constructing a new package it is using the [`Package` struct](https://github.com/anchore/syft/blob/main/syft/pkg/package.go#L16-L31).
147
+
Identified packages share a common struct so be sure that when the new cataloger is constructing a new package it is using the [`Package` struct](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg/package.go#L16-L31).
148
148
149
149
Metadata Note: Identified packages are also assigned specific metadata that can be unique to their environment.
150
-
See [this folder](https://github.com/anchore/syft/tree/main/syft/pkg) for examples of the different metadata types.
150
+
See [this folder](https://github.com/anchore/syft/tree/v0.70.0/syft/pkg) for examples of the different metadata types.
151
151
These are plugged into the `MetadataType` and `Metadata` fields in the above struct. `MetadataType` informs which type is being used. `Metadata` is an interface converted to that type.
152
152
153
153
Finally, here is an example of where the package construction is done in the apk cataloger. The first link is where `newPackage` is called in the `parseFunction`. The second link shows the package construction:
154
-
- [Call for new package](https://github.com/anchore/syft/blob/6a7d6e6071829c7ce2943266c0e187b27c0b325c/syft/pkg/cataloger/apkdb/parse_apk_db.go#L96-L99)
If you have more questions about implementing a cataloger or questions about one you might be currently working
158
158
always feel free to file an issue or reach out to us [on slack](https://anchore.com/slack).
159
+
160
+
#### Searching for files
161
+
162
+
All catalogers are provided an instance of the [`source.FileResolver`](https://github.com/anchore/syft/blob/v0.70.0/syft/source/file_resolver.go#L8) to interface with the image and search for files. The implementations for these
163
+
abstractions leverage [`stereoscope`](https://github.com/anchore/stereoscope) in order to perform searching. Here is a
164
+
rough outline how that works:
165
+
166
+
1. a stereoscope `file.Index` is searched based on the input given (a path, glob, or MIME type). The index is relatively fast to search, but requires results to be filtered down to the files that exist in the specific layer(s) of interest. This is done automatically by the `filetree.Searcher` abstraction. This abstraction will fallback to searching directly against the raw `filetree.FileTree` if the index does not contain the file(s) of interest. Note: the `filetree.Searcher` is used by the `source.FileResolver` abstraction.
167
+
2. Once the set of files are returned from the `filetree.Searcher` the results are filtered down further to return the most unique file results. For example, you may have requested for files by a glob that returns multiple results. These results are filtered down to deduplicate by real files, so if a result contains two references to the same file, say one accessed via symlink and one accessed via the real path, then the real path reference is returned and the symlink reference is filtered out. If both were accessed by symlink then the first (by lexical order) is returned. This is done automatically by the `source.FileResolver` abstraction.
168
+
3. By the time results reach the `pkg.Cataloger` you are guaranteed to have a set of unique files that exist in the layer(s) of interest (relative to what the resolver supports).
// since there is potentially considerable work for each symlink/hardlink that needs to be resolved, let's check to see if this is a symlink/hardlink first
54
-
entry, err := r.img.FileCatalog.Get(ref)
55
-
if err != nil {
56
-
return nil, fmt.Errorf("unable to fetch metadata (ref=%+v): %w", ref, err)
57
-
}
58
-
59
-
if entry.Metadata.TypeFlag == tar.TypeLink || entry.Metadata.TypeFlag == tar.TypeSymlink {
60
-
// a link may resolve in this layer or higher, assuming a squashed tree is used to search
61
-
// we should search all possible resolutions within the valid source
// since there is potentially considerable work for each symlink/hardlink that needs to be resolved, let's check to see if this is a symlink/hardlink first
53
+
entry, err := r.img.FileCatalog.Get(ref)
54
+
if err != nil {
55
+
return nil, fmt.Errorf("unable to fetch metadata (ref=%+v): %w", ref, err)
56
+
}
57
+
58
+
if entry.Metadata.Type == file.TypeHardLink || entry.Metadata.Type == file.TypeSymLink {
59
+
// a link may resolve in this layer or higher, assuming a squashed tree is used to search
60
+
// we should search all possible resolutions within the valid source