dc.contributor.author | Perdices Burrero, Daniel | |
dc.contributor.author | Ramos, Javier | |
dc.contributor.author | García Dorado, José Luis | |
dc.contributor.author | González, Iván | |
dc.contributor.author | López de Vergara Méndez, Jorge Enrique | |
dc.contributor.other | UAM. Departamento de Tecnología Electrónica y de las Comunicaciones | es_ES |
dc.date.accessioned | 2022-03-11T11:44:19Z | |
dc.date.available | 2022-03-11T11:44:19Z | |
dc.date.issued | 2021-10-24 | |
dc.identifier.citation | Computer Networks 198 (2021): 108357 | en_US |
dc.identifier.issn | 1389-1286 | es_ES |
dc.identifier.uri | http://hdl.handle.net/10486/700706 | |
dc.description.abstract | In an Internet arena where the search engines and other digital marketing firms’ revenues peak, other actors still have open opportunities to monetize their users’ data. After the convenient anonymization, aggregation, and agreement, the set of websites users visit may result in exploitable data for ISPs. Uses cover from assessing the scope of advertising campaigns to reinforcing user fidelity among other marketing approaches, as well as security issues. However, sniffers based on HTTP, DNS, TLS or flow features do not suffice for this task. Modern websites are designed for preloading and prefetching some contents in addition to embedding banners, social networks’ links, images, and scripts from other websites. This self-triggered traffic makes it confusing to assess which websites users visited on purpose. Moreover, DNS caches prevent some queries of actively visited websites to be even sent. On this limited input, we propose to handle such domains as words and the sequences of domains as documents. This way, it is possible to identify the visited websites by translating this problem to a text classification context and applying the most promising techniques of the natural language processing and neural networks fields. After applying different representation methods such as TF–IDF, Word2vec, Doc2vec, and custom neural networks in diverse scenarios and with several datasets, we can state websites visited on purpose with accuracy figures over 90%, with peaks close to 100%, being processes that are fully automated and free of any human parametrization | en_US |
dc.description.sponsorship | This research has been partially funded by the Spanish State Research
Agency under the project AgileMon (AEI PID2019-104451RBC21)
and by the Spanish Ministry of Science, Innovation and Universities
under the program for the training of university lecturers (Grant
number: FPU19/05678) | en_US |
dc.format.extent | 14 pag. | es_ES |
dc.format.mimetype | application/pdf | es_ES |
dc.language.iso | eng | en |
dc.publisher | Elsevier | en_US |
dc.relation.ispartof | Computer Networks | en_US |
dc.rights | © 2021 The Authors | en_US |
dc.subject.other | Deep learning | en_US |
dc.subject.other | Internet monitoring | en_US |
dc.subject.other | Natural language processing | en_US |
dc.subject.other | Traffic monetization | en_US |
dc.subject.other | Users analytics | en_US |
dc.subject.other | Web browsing | en_US |
dc.title | Natural language processing for web browsing analytics: Challenges, lessons learned, and opportunities | en_US |
dc.type | article | en_US |
dc.subject.eciencia | Telecomunicaciones | es_ES |
dc.relation.publisherversion | https://doi.org/10.1016/j.comnet.2021.108357 | es_ES |
dc.identifier.doi | 10.1016/j.comnet.2021.108357 | es_ES |
dc.identifier.publicationfirstpage | 108357-1 | es_ES |
dc.identifier.publicationlastpage | 108357-14 | es_ES |
dc.identifier.publicationvolume | 198 | es_ES |
dc.relation.projectID | Gobierno de España. PID2019-104451RBC21 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | en |
dc.rights.cc | Reconocimiento – NoComercial – SinObraDerivada | |
dc.rights.accessRights | openAccess | es_ES |
dc.authorUAM | Ramos De Santiago, Fco. Javier (261890) | |
dc.authorUAM | García Dorado, José Luis (261729) | |
dc.authorUAM | López De Vergara Méndez, Jorge Enrique (261085) | |
dc.facultadUAM | Escuela Politécnica Superior | |