Data collection and anonymization pipeline

This figures shows our data collection pipeline and the network topology. Shared graduate student apartments have one Ethernet port per bedroom, while other apartments have one Ethernet port per apartment. The ports connect to a switch in the residential building, which connects to an aggregation switch and then to the Internet via the campus network and a few providers. The aggregation switch mirrors both traffic to and from our residential buildings over 2x10Gbps dedicated fiber to a server in our nearby lab. Since Columbia has not deployed IPv6 in these buildings, we only study IPv4 traffic.
Our data collection/anonymization pipeline followed established practices, was approved by Columbia’s IT, and received formal review and was declared exempt by our Institutional Review Board (IRB) as it is not human-subjects research. It anonymizes privacy-sensitive fields and discards personally identifiable information. We do not identify any human or study network usage below the level of buildings.


Mapping flows to services

We associate each flow with a service (e.g., Netflix, YouTube) using a combination of domain keyword matching, unsupervised clustering, and transport-layer heuristics.

Keyword-Based Mapping:

We match ⟨DNS, SNI⟩ domain pairs against a curated list of ~200 service-related keywords (e.g., domains containing "nflx" are mapped to Netflix). This rule-based mapping, built upon the public nDPI keyword set, accounts for 73% of traffic by volume. The list of keywords is provided below:


Keyword Service
nflxnetflix
hbomaxhbomax
apple-dnsicloud
icloudicloud
tv.appleappletv
itunesapplestore
aaplicloud
instagraminstagram
blizzardblizzard
huluhulu
icloudicloud
steamsteam
outlookmicrosoftcloud
twimgtwitter
googlevideoyoutube
tiktoktiktok
cdn-appleicloud
espnespn
movetvmovetv
redd.itreddit
spotifyspotify
zoomzoom
slackslack
peacockpeacock
gmailgmail
ytimgyoutube
fitbitfitbit
stripestripe
bestbuybestbuy
robloxroblox
youtubeyoutube
ooklaookla
sling.comsling
cdn-appleapplecdn
appleicloud
fbcdnfacebook
facetime.applefacetime
messenger.comfacebookmessenger
ttvnwtwitch
photosdata-pa.googleapisgooglephotos
uploadgiguploadgig
idriveidrive
wireguardwireguardvpn
megaphonespotify
cbsivideocbsvideo
pbs.orgpbs
dropboxdropbox
hbohbomax
rokuroku
warnerwarner
spectrumspectrum
xboxxbox
cbsaavideocbsvideo
msggomsggo
nvidiagridnvidiagrid
oneclientmicrosoftcloud
skypemsteams
youtubeyoutube
plutotvplutotv
pluto.tvplutotv
redditreddit
facebookfacebook
taobaotaobao
shopifyshopify
githubgithub
grammarlygrammarly
whatsappwhatsapp
nytnewyorktimes
tidaltidal
twitchcdntwitch
twittertwitter
dssottdisneyplus
microsoftmicrosoft
teams.microsoftmsteams
adobeadobe
bilivideobilivideo
line-scdnline
scdnspotify
telegramtelegram
qooqlevideoqooqlevideo
wattpadwattpad
riotcdnriotcdn
cbsnewscbsnews
pandorapandora
siriusxmpandora
torproject.orgtor
echoamazonecho
officemicrosoftcloud
stitcherstitcher
fbsbxfacebook
redgifsreddit
playstationplaystation
wikimediawikipedia
metmuseummetmuseum
courseworkscourseworks
wordpresswordpress
discorddiscord
zillowzillow
windowsmicrosoft
onlyfansonlyfans
tumblrtumblr
xvideosxvideos
llnwdlimelight
xhcdnxhcdn
igcdninstagram
wechatwechat
phncdnphcdn
bumble.combumble
edgesuitemicrosoftcloud
tindertinder
pv-cdnprimevideo
gaijingaijin
wetransferwetransfer
epornereporner
wsjwallstreetjournal
mushroomtrackmushroomtrack
comcastcomcast
epicgamesepicgames
pinterestpinterest
linkedinlinkedin
pornezpornez
squarespacesquarespace
paramountplusparamountplus
mmcdnmmcdn
gopuffgopuff
photos.googlegooglephotos
messages.googlegooglemessages
video.googleyoutube
groups.googlegooglegroups
play.googlegoogleplay
drive.googlegoogledrive
calendar.googlegooglecalendar
spreadsheets.googlegoogledrive
chat.googlegooglemessages
googleusercontentgoogleusercontent
webexwebex
ciscocisco
foxitsoftwarefoxitsoftware
campusgroupscampusgroups
condaconda
columbiacolumbia
eerospeedtestseerospeedtests
dailymotiondailymotion
samsungsamsung
notionnotion
sndcdnsoundcloud
soundcloudsoundcloud
groupmegroupme
mail.googlegmail
docs.googlegdocs
nintendonintendo
ubisoftubisoft
showtimeshowtime
crunchyrollcrunchyroll
baddiehubbaddiehub
dssedgedisneyplus
sc‑cdnsnapchat
onedrivemicrosoftcloud
live‑videotwitch
aiv‑cdnprimevideo
olemovienewsolemovienews
afreecatvafreecatv
cdn‑videos.lpsglpsg
v.vrvcrunchyroll
fubofubo
bittorrentbittorrent
publicbtbittorrent
android.googleapisplaystore
torrentsbittorrent
watchliveformula1watchliveformula1
storage.livemicrosoftcloud
ott‑video‑cf.formula1ott‑video‑cf.formula1
kakaokakaotalk
cwtvcartoonnetwork
vimeovimeo
disneydisneyplus
licdnlinkedin
kanopykanopy
bdsmlrbdsmlr
sharepointmicrosoftcloud
tubitubi
n.shifenbaidu
primevideoprimevideo
amazonvideoprimevideo
video.a2zprimevideo
clients.google.complaystore
clients6.google.complaystore
snapchatsnapchat
inbox.google.comgmail
meet.googlegooglemeet
netflixnetflix
hoyoversehoyoverse
hingehinge
theleaguetheleague
porntnporntn
xnxx-cdnxnxx-cdn
patreonpatreon
mcafeemcafee
cloudfrontprimevideo
max.comhbomax

Unsupervised Domain Clustering:

For domain pairs not covered by keywords, we use an unsupervised learning approach to identify clusters of related ⟨DNS, SNI⟩ pairs based on temporal correlation—how often they appear near each other in time. We apply the Louvain clustering algorithm and assign each cluster to the service that dominates its traffic. If a service makes up ≥ 60% of a cluster's traffic, we label the entire cluster accordingly. This adds another 6.4% of traffic to our service mapping.

Transport-Level Heuristics:

For flows with no DNS or SNI data, we apply manual rules based on known transport-layer signatures (e.g., destination AS + port + protocol). Traffic on ports 16393–16402 or ports 3478–3497 from ASN 714 was classified as FaceTime. Port 51820 was used to identify WireGuard VPN, while port 3480 traffic from Microsoft's ASN (8075) indicated Microsoft Teams usage. Google Meet was detected via port 3478 traffic from Google's ASN (15169). Twitch traffic was identified using its ASN (46489), and Facebook Messenger by matching ASN 32934 with port 3478. We also labeled traffic from Ubisoft (ASN 49544) and PlayStation (ASN 33353). Finally, BitTorrent traffic was inferred from activity on ports 6881–6889.