Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities

Perdices Burrero, Daniel; Ramos, Javier; García Dorado, José Luis; González, Iván; López de Vergara Méndez, Jorge Enrique

UAM_Biblioteca

dc.contributor.author	Perdices Burrero, Daniel
dc.contributor.author	Ramos, Javier
dc.contributor.author	García Dorado, José Luis
dc.contributor.author	González, Iván
dc.contributor.author	López de Vergara Méndez, Jorge Enrique
dc.contributor.other	UAM. Departamento de Tecnología Electrónica y de las Comunicaciones	es_ES
dc.date.accessioned	2022-03-11T11:44:19Z
dc.date.available	2022-03-11T11:44:19Z
dc.date.issued	2021-10-24
dc.identifier.citation	Computer Networks 198 (2021): 108357	en_US
dc.identifier.issn	1389-1286	es_ES
dc.identifier.uri	http://hdl.handle.net/10486/700706
dc.description.abstract	In an Internet arena where the search engines and other digital marketing firms’ revenues peak, other actors still have open opportunities to monetize their users’ data. After the convenient anonymization, aggregation, and agreement, the set of websites users visit may result in exploitable data for ISPs. Uses cover from assessing the scope of advertising campaigns to reinforcing user fidelity among other marketing approaches, as well as security issues. However, sniffers based on HTTP, DNS, TLS or flow features do not suffice for this task. Modern websites are designed for preloading and prefetching some contents in addition to embedding banners, social networks’ links, images, and scripts from other websites. This self-triggered traffic makes it confusing to assess which websites users visited on purpose. Moreover, DNS caches prevent some queries of actively visited websites to be even sent. On this limited input, we propose to handle such domains as words and the sequences of domains as documents. This way, it is possible to identify the visited websites by translating this problem to a text classification context and applying the most promising techniques of the natural language processing and neural networks fields. After applying different representation methods such as TF–IDF, Word2vec, Doc2vec, and custom neural networks in diverse scenarios and with several datasets, we can state websites visited on purpose with accuracy figures over 90%, with peaks close to 100%, being processes that are fully automated and free of any human parametrization	en_US
dc.description.sponsorship	This research has been partially funded by the Spanish State Research Agency under the project AgileMon (AEI PID2019-104451RBC21) and by the Spanish Ministry of Science, Innovation and Universities under the program for the training of university lecturers (Grant number: FPU19/05678)	en_US
dc.format.extent	14 pag.	es_ES
dc.format.mimetype	application/pdf	es_ES
dc.language.iso	eng	en
dc.publisher	Elsevier	en_US
dc.relation.ispartof	Computer Networks	en_US
dc.rights	© 2021 The Authors	en_US
dc.subject.other	Deep learning	en_US
dc.subject.other	Internet monitoring	en_US
dc.subject.other	Natural language processing	en_US
dc.subject.other	Traffic monetization	en_US
dc.subject.other	Users analytics	en_US
dc.subject.other	Web browsing	en_US
dc.title	Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities	en_US
dc.type	article	en_US
dc.subject.eciencia	Telecomunicaciones	es_ES
dc.relation.publisherversion	https://doi.org/10.1016/j.comnet.2021.108357	es_ES
dc.identifier.doi	10.1016/j.comnet.2021.108357	es_ES
dc.identifier.publicationfirstpage	108357-1	es_ES
dc.identifier.publicationlastpage	108357-14	es_ES
dc.identifier.publicationvolume	198	es_ES
dc.relation.projectID	Gobierno de España. PID2019-104451RBC21	es_ES
dc.type.version	info:eu-repo/semantics/publishedVersion	en
dc.rights.cc	Reconocimiento – NoComercial – SinObraDerivada
dc.rights.accessRights	openAccess	es_ES
dc.authorUAM	Ramos De Santiago, Fco. Javier (261890)
dc.authorUAM	García Dorado, José Luis (261729)
dc.authorUAM	López De Vergara Méndez, Jorge Enrique (261085)
dc.facultadUAM	Escuela Politécnica Superior

Files in this item

Name:: natural_perdices_comput.netw_2 ...
Size:: 1.520Mb
Format:: PDF

This item appears in the following Collection(s)

Producción científica en acceso abierto de la UAM [20434]

Show simple item record

UAM_Biblioteca

Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities

Files in this item

This item appears in the following Collection(s)

Related items

Web browsing privacy in the deep learning era: Beyond VPNs and encryption ﻿

FlexiTop: Sistema escalable y flexible de medidas de calidad para servicios Over-The-Top ﻿

FlexiTop: A flexible and scalable network monitoring system for Over-The-Top services ﻿

Web browsing privacy in the deep learning era: Beyond VPNs and encryption

FlexiTop: Sistema escalable y flexible de medidas de calidad para servicios Over-The-Top

FlexiTop: A flexible and scalable network monitoring system for Over-The-Top services