License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/OASIcs.SLATE.2021.13
URN: urn:nbn:de:0030-drops-144305
URL: http://dagstuhl.sunsite.rwth-aachen.de/volltexte/2021/14430/
Go to the corresponding OASIcs Volume Portal


Hussain, Zafar ; Nurminen, Jukka K. ; Mikkonen, Tommi ; Kowiel, Marcin

Command Similarity Measurement Using NLP

pdf-format:
OASIcs-SLATE-2021-13.pdf (0.9 MB)


Abstract

Process invocations happen with almost every activity on a computer. To distinguish user input and potentially malicious activities, we need to better understand program invocations caused by commands. To achieve this, one must understand commands’ objectives, possible parameters, and valid syntax. In this work, we collected commands’ data by scrapping commands’ manual pages, including command description, syntax, and parameters. Then, we measured command similarity using two of these - description and parameters - based on commands' natural language documentation. We used Term Frequency-Inverse Document Frequency (TFIDF) of a word to compare the commands, followed by measuring cosine similarity to find a similarity of commands’ description. For parameters, after measuring TFIDF and cosine similarity, the Hungarian method is applied to solve the assignment of different parameters’ combinations. Finally, commands are clustered based on their similarity scores. The results show that these methods have efficiently clustered the commands in smaller groups (commands with aliases or close counterparts), and in a bigger group (commands belonging to a larger set of related commands, e.g., bitsadmin for Windows and systemd for Linux). To validate the clustering results, we applied topic modeling on the commands' data, which confirms that 84% of the Windows commands and 98% ofthe Linux commands are clustered correctly.

BibTeX - Entry

@InProceedings{hussain_et_al:OASIcs.SLATE.2021.13,
  author =	{Hussain, Zafar and Nurminen, Jukka K. and Mikkonen, Tommi and Kowiel, Marcin},
  title =	{{Command Similarity Measurement Using NLP}},
  booktitle =	{10th Symposium on Languages, Applications and Technologies (SLATE 2021)},
  pages =	{13:1--13:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-202-0},
  ISSN =	{2190-6807},
  year =	{2021},
  volume =	{94},
  editor =	{Queir\'{o}s, Ricardo and Pinto, M\'{a}rio and Sim\~{o}es, Alberto and Portela, Filipe and Pereira, Maria Jo\~{a}o},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2021/14430},
  URN =		{urn:nbn:de:0030-drops-144305},
  doi =		{10.4230/OASIcs.SLATE.2021.13},
  annote =	{Keywords: Natural Language Processing, NLP, Windows Commands, Linux Commands, Textual Similarity, Command Term Frequency, Inverse Document Frequency, TFIDF, Cosine Similarity, Linear Sum Assignment, Command Clustering}
}

Keywords: Natural Language Processing, NLP, Windows Commands, Linux Commands, Textual Similarity, Command Term Frequency, Inverse Document Frequency, TFIDF, Cosine Similarity, Linear Sum Assignment, Command Clustering
Collection: 10th Symposium on Languages, Applications and Technologies (SLATE 2021)
Issue Date: 2021
Date of publication: 10.08.2021


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI