Data Anonymization Techniques for Geospatial Data#

This notebook demonstrates various techniques for anonymizing a dataset containing place names, latitude, longitude, and wealth values. Each technique has its own trade-offs between data privacy and utility, and the choice of method depends on the requirements for anonymization and analysis.

1. Aggregation/Generalization#

Aggregation or generalization involves reducing the precision of geographic coordinates or grouping data into broader categories to make identification more difficult.

For example, latitude and longitude values can be rounded to reduce precision, making it harder to identify exact locations.

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'place': ['A', 'B', 'C'],
    'latitude': [34.0522, 37.7749, 40.7128],
    'longitude': [-118.2437, -122.4194, -74.0060],
    'wealth': [100000, 150000, 200000]
})

# Generalizing coordinates by rounding to one decimal place
df['latitude_generalized'] = df['latitude'].round(1)
df['longitude_generalized'] = df['longitude'].round(1)

print("Data after generalization:")
df
Data after generalization:
place latitude longitude wealth latitude_generalized longitude_generalized
0 A 34.0522 -118.2437 100000 34.1 -118.2
1 B 37.7749 -122.4194 150000 37.8 -122.4
2 C 40.7128 -74.0060 200000 40.7 -74.0

2. Adding Spatial Noise#

Adding spatial noise involves randomly perturbing the latitude and longitude values. This can help anonymize the data while preserving overall geographic trends.

import random

# Function to add noise to latitude and longitude
def add_spatial_noise(lat, lon, noise_level=0.01):
    noisy_lat = lat + random.uniform(-noise_level, noise_level)
    noisy_lon = lon + random.uniform(-noise_level, noise_level)
    return noisy_lat, noisy_lon

# Applying the spatial noise function
df['latitude_noisy'], df['longitude_noisy'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))

print("Data after adding spatial noise:")
df
Data after adding spatial noise:
place latitude longitude wealth latitude_generalized longitude_generalized latitude_noisy longitude_noisy
0 A 34.0522 -118.2437 100000 34.1 -118.2 34.056263 -118.240379
1 B 37.7749 -122.4194 150000 37.8 -122.4 37.765504 -122.412138
2 C 40.7128 -74.0060 200000 40.7 -74.0 40.709770 -74.011202

3. K-Anonymity for Spatial Data#

K-anonymity ensures that each data point is indistinguishable from at least (k-1) other data points. This can be achieved by clustering the data and using the cluster centroids as anonymized locations.

# pip install scikit-mobility
import pandas as pd
from skmob import TrajDataFrame
from skmob.preprocessing import clustering

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.0522, 34.0523, 34.0524, 34.0525],
    'longitude': [-118.2437, -118.2438, -118.2439, -118.2440],
    'datetime': pd.date_range(start='2024-10-01', periods=4, freq='H')  # Sample datetime values
}
df = pd.DataFrame(data)

# Convert the DataFrame to a TrajDataFrame for clustering
tdf = TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='datetime')

# Applying clustering with a 1 km radius for k-anonymity
clustered_tdf = clustering.cluster(tdf, cluster_radius_km=1, min_samples=2)

print("Data after applying k-anonymity clustering:")
print(clustered_tdf)
Data after applying k-anonymity clustering:
       lat       lng            datetime  cluster
0  34.0522 -118.2437 2024-10-01 00:00:00        0
1  34.0523 -118.2438 2024-10-01 01:00:00        0
2  34.0524 -118.2439 2024-10-01 02:00:00        0
3  34.0525 -118.2440 2024-10-01 03:00:00        0

4. Binning Wealth Values#

To anonymize sensitive numerical data, such as wealth, binning can be used. This involves converting continuous values into discrete categories.

import pandas as pd

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'wealth': [50000, 120000, 180000, 75000, 210000, 130000]
}
df = pd.DataFrame(data)

# Binning the wealth column into discrete ranges
df['wealth_binned'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])

print("Data after binning wealth values:")
print(df)
Data after binning wealth values:
   wealth wealth_binned
0   50000           Low
1  120000        Medium
2  180000          High
3   75000           Low
4  210000           NaN
5  130000        Medium

5. Data Masking for Place Names#

Data masking replaces identifiable values, such as place names, with broader categories or randomly generated names to protect privacy.

#pip install faker
from faker import Faker

fake = Faker()
# Replace place names with synthetic city names
df['place_synthetic'] = [fake.city() for _ in range(len(df))]

print("Data after masking place names:")
df
Data after masking place names:
wealth wealth_binned place_synthetic
0 50000 Low Mitchellbury
1 120000 Medium South Jason
2 180000 High Perkinschester
3 75000 Low East Steven
4 210000 NaN Monicastad
5 130000 Medium Andrewberg

6. Using Synthetic Data Generation#

Synthetic data generation involves creating new data points that mimic the statistical properties of the original dataset. This can help preserve the privacy of individuals while maintaining data utility for analysis.

import pandas as pd
import numpy as np

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
    'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
    'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)

# Generate synthetic latitude, longitude, and wealth values
df['latitude_synthetic'] = np.random.normal(df['latitude'].mean(), df['latitude'].std(), size=len(df))
df['longitude_synthetic'] = np.random.normal(df['longitude'].mean(), df['longitude'].std(), size=len(df))
df['wealth_synthetic'] = np.random.normal(df['wealth'].mean(), df['wealth'].std(), size=len(df))

print("Data after generating synthetic values:")
print(df)
Data after generating synthetic values:
   latitude  longitude  wealth  latitude_synthetic  longitude_synthetic  \
0     34.05    -118.25   50000           34.697749           -80.257765   
1     36.16    -115.15  120000           40.586576          -122.255409   
2     40.71     -74.01  180000           35.411037          -108.299022   
3     37.77    -122.42   75000           35.861865          -122.470457   
4     34.05    -118.25  210000           35.498666          -127.272600   

   wealth_synthetic  
0     120537.264473  
1     172639.157841  
2     166803.227519  
3     102927.222418  
4     255479.085813  

7. Combining Techniques#

Combining multiple anonymization techniques can provide a higher level of privacy protection. For example, spatial noise can be added along with binning of wealth values.

import pandas as pd
import numpy as np

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
    'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
    'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)

# Function to add spatial noise
def add_spatial_noise(latitude, longitude, noise_level=0.01):
    # Adding noise to latitude and longitude
    noisy_latitude = latitude + np.random.normal(0, noise_level)
    noisy_longitude = longitude + np.random.normal(0, noise_level)
    return noisy_latitude, noisy_longitude

# Adding spatial noise and binning wealth values
df['latitude_combined'], df['longitude_combined'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))
df['wealth_combined'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])

print("Data after combining anonymization techniques:")
print(df)
Data after combining anonymization techniques:
   latitude  longitude  wealth  latitude_combined  longitude_combined  \
0     34.05    -118.25   50000          34.071089         -118.253393   
1     36.16    -115.15  120000          36.166661         -115.160829   
2     40.71     -74.01  180000          40.720372          -73.994760   
3     37.77    -122.42   75000          37.772173         -122.411337   
4     34.05    -118.25  210000          34.064031         -118.240093   

  wealth_combined  
0             Low  
1          Medium  
2            High  
3             Low  
4             NaN  

8. Data Swapping#

Data swapping involves exchanging values of sensitive attributes (e.g., wealth) between different records. This technique helps anonymize the data while preserving the overall statistical distribution.

For example, we can randomly swap the wealth values between different locations in the dataset.

import numpy as np

# Function to swap values in the wealth column randomly
def swap_values(column):
    shuffled = column.sample(frac=1).values
    return shuffled

# Apply swapping to the 'wealth' column
df['wealth_swapped'] = swap_values(df['wealth'])

print("Data after swapping wealth values:")
df
Data after swapping wealth values:
latitude longitude wealth latitude_combined longitude_combined wealth_combined wealth_swapped
0 34.05 -118.25 50000 34.071089 -118.253393 Low 120000
1 36.16 -115.15 120000 36.166661 -115.160829 Medium 75000
2 40.71 -74.01 180000 40.720372 -73.994760 High 180000
3 37.77 -122.42 75000 37.772173 -122.411337 Low 210000
4 34.05 -118.25 210000 34.064031 -118.240093 NaN 50000

Data swapping can help protect sensitive information, but excessive swapping might distort the relationships between attributes. The amount of swapping should be chosen carefully to balance privacy and data utility.