Data Anonymization Techniques for Geospatial Data

Data Anonymization Techniques for Geospatial Data#

This notebook demonstrates various techniques for anonymizing a dataset containing place names, latitude, longitude, and wealth values. Each technique has its own trade-offs between data privacy and utility, and the choice of method depends on the requirements for anonymization and analysis.

1. Aggregation/Generalization#

Aggregation or generalization involves reducing the precision of geographic coordinates or grouping data into broader categories to make identification more difficult.

For example, latitude and longitude values can be rounded to reduce precision, making it harder to identify exact locations.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'place': ['A', 'B', 'C'],
    'latitude': [34.0522, 37.7749, 40.7128],
    'longitude': [-118.2437, -122.4194, -74.0060],
    'wealth': [100000, 150000, 200000]
})

# Generalizing coordinates by rounding to one decimal place
df['latitude_generalized'] = df['latitude'].round(1)
df['longitude_generalized'] = df['longitude'].round(1)

print("Data after generalization:")
df

Data after generalization:

	place	latitude	longitude	wealth	latitude_generalized	longitude_generalized
0	A	34.0522	-118.2437	100000	34.1	-118.2
1	B	37.7749	-122.4194	150000	37.8	-122.4
2	C	40.7128	-74.0060	200000	40.7	-74.0

2. Adding Spatial Noise#

Adding spatial noise involves randomly perturbing the latitude and longitude values. This can help anonymize the data while preserving overall geographic trends.

import random

# Function to add noise to latitude and longitude
def add_spatial_noise(lat, lon, noise_level=0.01):
    noisy_lat = lat + random.uniform(-noise_level, noise_level)
    noisy_lon = lon + random.uniform(-noise_level, noise_level)
    return noisy_lat, noisy_lon

# Applying the spatial noise function
df['latitude_noisy'], df['longitude_noisy'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))

print("Data after adding spatial noise:")
df

Data after adding spatial noise:

	place	latitude	longitude	wealth	latitude_generalized	longitude_generalized	latitude_noisy	longitude_noisy
0	A	34.0522	-118.2437	100000	34.1	-118.2	34.056263	-118.240379
1	B	37.7749	-122.4194	150000	37.8	-122.4	37.765504	-122.412138
2	C	40.7128	-74.0060	200000	40.7	-74.0	40.709770	-74.011202

3. K-Anonymity for Spatial Data#

K-anonymity ensures that each data point is indistinguishable from at least (k-1) other data points. This can be achieved by clustering the data and using the cluster centroids as anonymized locations.

# pip install scikit-mobility

import pandas as pd
from skmob import TrajDataFrame
from skmob.preprocessing import clustering

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.0522, 34.0523, 34.0524, 34.0525],
    'longitude': [-118.2437, -118.2438, -118.2439, -118.2440],
    'datetime': pd.date_range(start='2024-10-01', periods=4, freq='H')  # Sample datetime values
}
df = pd.DataFrame(data)

# Convert the DataFrame to a TrajDataFrame for clustering
tdf = TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='datetime')

# Applying clustering with a 1 km radius for k-anonymity
clustered_tdf = clustering.cluster(tdf, cluster_radius_km=1, min_samples=2)

print("Data after applying k-anonymity clustering:")
print(clustered_tdf)

Data after applying k-anonymity clustering:
       lat       lng            datetime  cluster
0  34.0522 -118.2437 2024-10-01 00:00:00        0
1  34.0523 -118.2438 2024-10-01 01:00:00        0
2  34.0524 -118.2439 2024-10-01 02:00:00        0
3  34.0525 -118.2440 2024-10-01 03:00:00        0

4. Binning Wealth Values#

To anonymize sensitive numerical data, such as wealth, binning can be used. This involves converting continuous values into discrete categories.

import pandas as pd

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'wealth': [50000, 120000, 180000, 75000, 210000, 130000]
}
df = pd.DataFrame(data)

# Binning the wealth column into discrete ranges
df['wealth_binned'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])

print("Data after binning wealth values:")
print(df)

Data after binning wealth values:
   wealth wealth_binned
 50000           Low
120000        Medium
180000          High
 75000           Low
210000           NaN
130000        Medium

5. Data Masking for Place Names#

Data masking replaces identifiable values, such as place names, with broader categories or randomly generated names to protect privacy.

#pip install faker

from faker import Faker

fake = Faker()
# Replace place names with synthetic city names
df['place_synthetic'] = [fake.city() for _ in range(len(df))]

print("Data after masking place names:")
df

Data after masking place names:

	wealth	wealth_binned	place_synthetic
0	50000	Low	Mitchellbury
1	120000	Medium	South Jason
2	180000	High	Perkinschester
3	75000	Low	East Steven
4	210000	NaN	Monicastad
5	130000	Medium	Andrewberg

6. Using Synthetic Data Generation#

Synthetic data generation involves creating new data points that mimic the statistical properties of the original dataset. This can help preserve the privacy of individuals while maintaining data utility for analysis.

import pandas as pd
import numpy as np

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
    'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
    'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)

# Generate synthetic latitude, longitude, and wealth values
df['latitude_synthetic'] = np.random.normal(df['latitude'].mean(), df['latitude'].std(), size=len(df))
df['longitude_synthetic'] = np.random.normal(df['longitude'].mean(), df['longitude'].std(), size=len(df))
df['wealth_synthetic'] = np.random.normal(df['wealth'].mean(), df['wealth'].std(), size=len(df))

print("Data after generating synthetic values:")
print(df)

Data after generating synthetic values:
   latitude  longitude  wealth  latitude_synthetic  longitude_synthetic  \
   34.05    -118.25   50000           34.697749           -80.257765   
   36.16    -115.15  120000           40.586576          -122.255409   
   40.71     -74.01  180000           35.411037          -108.299022   
   37.77    -122.42   75000           35.861865          -122.470457   
   34.05    -118.25  210000           35.498666          -127.272600   

   wealth_synthetic  
   120537.264473  
   172639.157841  
   166803.227519  
   102927.222418  
   255479.085813  

7. Combining Techniques#

Combining multiple anonymization techniques can provide a higher level of privacy protection. For example, spatial noise can be added along with binning of wealth values.

import pandas as pd
import numpy as np

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
    'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
    'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)

# Function to add spatial noise
def add_spatial_noise(latitude, longitude, noise_level=0.01):
    # Adding noise to latitude and longitude
    noisy_latitude = latitude + np.random.normal(0, noise_level)
    noisy_longitude = longitude + np.random.normal(0, noise_level)
    return noisy_latitude, noisy_longitude

# Adding spatial noise and binning wealth values
df['latitude_combined'], df['longitude_combined'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))
df['wealth_combined'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])

print("Data after combining anonymization techniques:")
print(df)

Data after combining anonymization techniques:
   latitude  longitude  wealth  latitude_combined  longitude_combined  \
   34.05    -118.25   50000          34.071089         -118.253393   
   36.16    -115.15  120000          36.166661         -115.160829   
   40.71     -74.01  180000          40.720372          -73.994760   
   37.77    -122.42   75000          37.772173         -122.411337   
   34.05    -118.25  210000          34.064031         -118.240093   

  wealth_combined  
           Low  
        Medium  
          High  
           Low  
           NaN  

8. Data Swapping#

Data swapping involves exchanging values of sensitive attributes (e.g., wealth) between different records. This technique helps anonymize the data while preserving the overall statistical distribution.

For example, we can randomly swap the wealth values between different locations in the dataset.

import numpy as np

# Function to swap values in the wealth column randomly
def swap_values(column):
    shuffled = column.sample(frac=1).values
    return shuffled

# Apply swapping to the 'wealth' column
df['wealth_swapped'] = swap_values(df['wealth'])

print("Data after swapping wealth values:")
df

Data after swapping wealth values:

	latitude	longitude	wealth	latitude_combined	longitude_combined	wealth_combined	wealth_swapped
0	34.05	-118.25	50000	34.071089	-118.253393	Low	120000
1	36.16	-115.15	120000	36.166661	-115.160829	Medium	75000
2	40.71	-74.01	180000	40.720372	-73.994760	High	180000
3	37.77	-122.42	75000	37.772173	-122.411337	Low	210000
4	34.05	-118.25	210000	34.064031	-118.240093	NaN	50000

Data swapping can help protect sensitive information, but excessive swapping might distort the relationships between attributes. The amount of swapping should be chosen carefully to balance privacy and data utility.