Data Anonymization Techniques for Geospatial Data#
This notebook demonstrates various techniques for anonymizing a dataset containing place names, latitude, longitude, and wealth values. Each technique has its own trade-offs between data privacy and utility, and the choice of method depends on the requirements for anonymization and analysis.
1. Aggregation/Generalization#
Aggregation or generalization involves reducing the precision of geographic coordinates or grouping data into broader categories to make identification more difficult.
For example, latitude and longitude values can be rounded to reduce precision, making it harder to identify exact locations.
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
'place': ['A', 'B', 'C'],
'latitude': [34.0522, 37.7749, 40.7128],
'longitude': [-118.2437, -122.4194, -74.0060],
'wealth': [100000, 150000, 200000]
})
# Generalizing coordinates by rounding to one decimal place
df['latitude_generalized'] = df['latitude'].round(1)
df['longitude_generalized'] = df['longitude'].round(1)
print("Data after generalization:")
df
Data after generalization:
place | latitude | longitude | wealth | latitude_generalized | longitude_generalized | |
---|---|---|---|---|---|---|
0 | A | 34.0522 | -118.2437 | 100000 | 34.1 | -118.2 |
1 | B | 37.7749 | -122.4194 | 150000 | 37.8 | -122.4 |
2 | C | 40.7128 | -74.0060 | 200000 | 40.7 | -74.0 |
2. Adding Spatial Noise#
Adding spatial noise involves randomly perturbing the latitude and longitude values. This can help anonymize the data while preserving overall geographic trends.
import random
# Function to add noise to latitude and longitude
def add_spatial_noise(lat, lon, noise_level=0.01):
noisy_lat = lat + random.uniform(-noise_level, noise_level)
noisy_lon = lon + random.uniform(-noise_level, noise_level)
return noisy_lat, noisy_lon
# Applying the spatial noise function
df['latitude_noisy'], df['longitude_noisy'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))
print("Data after adding spatial noise:")
df
Data after adding spatial noise:
place | latitude | longitude | wealth | latitude_generalized | longitude_generalized | latitude_noisy | longitude_noisy | |
---|---|---|---|---|---|---|---|---|
0 | A | 34.0522 | -118.2437 | 100000 | 34.1 | -118.2 | 34.056263 | -118.240379 |
1 | B | 37.7749 | -122.4194 | 150000 | 37.8 | -122.4 | 37.765504 | -122.412138 |
2 | C | 40.7128 | -74.0060 | 200000 | 40.7 | -74.0 | 40.709770 | -74.011202 |
3. K-Anonymity for Spatial Data#
K-anonymity ensures that each data point is indistinguishable from at least (k-1) other data points. This can be achieved by clustering the data and using the cluster centroids as anonymized locations.
# pip install scikit-mobility
import pandas as pd
from skmob import TrajDataFrame
from skmob.preprocessing import clustering
# Sample data creation (Make sure to replace this with your actual data)
data = {
'latitude': [34.0522, 34.0523, 34.0524, 34.0525],
'longitude': [-118.2437, -118.2438, -118.2439, -118.2440],
'datetime': pd.date_range(start='2024-10-01', periods=4, freq='H') # Sample datetime values
}
df = pd.DataFrame(data)
# Convert the DataFrame to a TrajDataFrame for clustering
tdf = TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='datetime')
# Applying clustering with a 1 km radius for k-anonymity
clustered_tdf = clustering.cluster(tdf, cluster_radius_km=1, min_samples=2)
print("Data after applying k-anonymity clustering:")
print(clustered_tdf)
Data after applying k-anonymity clustering:
lat lng datetime cluster
0 34.0522 -118.2437 2024-10-01 00:00:00 0
1 34.0523 -118.2438 2024-10-01 01:00:00 0
2 34.0524 -118.2439 2024-10-01 02:00:00 0
3 34.0525 -118.2440 2024-10-01 03:00:00 0
4. Binning Wealth Values#
To anonymize sensitive numerical data, such as wealth, binning can be used. This involves converting continuous values into discrete categories.
import pandas as pd
# Sample data creation (Make sure to replace this with your actual data)
data = {
'wealth': [50000, 120000, 180000, 75000, 210000, 130000]
}
df = pd.DataFrame(data)
# Binning the wealth column into discrete ranges
df['wealth_binned'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])
print("Data after binning wealth values:")
print(df)
Data after binning wealth values:
wealth wealth_binned
0 50000 Low
1 120000 Medium
2 180000 High
3 75000 Low
4 210000 NaN
5 130000 Medium
5. Data Masking for Place Names#
Data masking replaces identifiable values, such as place names, with broader categories or randomly generated names to protect privacy.
#pip install faker
from faker import Faker
fake = Faker()
# Replace place names with synthetic city names
df['place_synthetic'] = [fake.city() for _ in range(len(df))]
print("Data after masking place names:")
df
Data after masking place names:
wealth | wealth_binned | place_synthetic | |
---|---|---|---|
0 | 50000 | Low | Mitchellbury |
1 | 120000 | Medium | South Jason |
2 | 180000 | High | Perkinschester |
3 | 75000 | Low | East Steven |
4 | 210000 | NaN | Monicastad |
5 | 130000 | Medium | Andrewberg |
6. Using Synthetic Data Generation#
Synthetic data generation involves creating new data points that mimic the statistical properties of the original dataset. This can help preserve the privacy of individuals while maintaining data utility for analysis.
import pandas as pd
import numpy as np
# Sample data creation (Make sure to replace this with your actual data)
data = {
'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)
# Generate synthetic latitude, longitude, and wealth values
df['latitude_synthetic'] = np.random.normal(df['latitude'].mean(), df['latitude'].std(), size=len(df))
df['longitude_synthetic'] = np.random.normal(df['longitude'].mean(), df['longitude'].std(), size=len(df))
df['wealth_synthetic'] = np.random.normal(df['wealth'].mean(), df['wealth'].std(), size=len(df))
print("Data after generating synthetic values:")
print(df)
Data after generating synthetic values:
latitude longitude wealth latitude_synthetic longitude_synthetic \
0 34.05 -118.25 50000 34.697749 -80.257765
1 36.16 -115.15 120000 40.586576 -122.255409
2 40.71 -74.01 180000 35.411037 -108.299022
3 37.77 -122.42 75000 35.861865 -122.470457
4 34.05 -118.25 210000 35.498666 -127.272600
wealth_synthetic
0 120537.264473
1 172639.157841
2 166803.227519
3 102927.222418
4 255479.085813
7. Combining Techniques#
Combining multiple anonymization techniques can provide a higher level of privacy protection. For example, spatial noise can be added along with binning of wealth values.
import pandas as pd
import numpy as np
# Sample data creation (Make sure to replace this with your actual data)
data = {
'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)
# Function to add spatial noise
def add_spatial_noise(latitude, longitude, noise_level=0.01):
# Adding noise to latitude and longitude
noisy_latitude = latitude + np.random.normal(0, noise_level)
noisy_longitude = longitude + np.random.normal(0, noise_level)
return noisy_latitude, noisy_longitude
# Adding spatial noise and binning wealth values
df['latitude_combined'], df['longitude_combined'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))
df['wealth_combined'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])
print("Data after combining anonymization techniques:")
print(df)
Data after combining anonymization techniques:
latitude longitude wealth latitude_combined longitude_combined \
0 34.05 -118.25 50000 34.071089 -118.253393
1 36.16 -115.15 120000 36.166661 -115.160829
2 40.71 -74.01 180000 40.720372 -73.994760
3 37.77 -122.42 75000 37.772173 -122.411337
4 34.05 -118.25 210000 34.064031 -118.240093
wealth_combined
0 Low
1 Medium
2 High
3 Low
4 NaN
8. Data Swapping#
Data swapping involves exchanging values of sensitive attributes (e.g., wealth) between different records. This technique helps anonymize the data while preserving the overall statistical distribution.
For example, we can randomly swap the wealth values between different locations in the dataset.
import numpy as np
# Function to swap values in the wealth column randomly
def swap_values(column):
shuffled = column.sample(frac=1).values
return shuffled
# Apply swapping to the 'wealth' column
df['wealth_swapped'] = swap_values(df['wealth'])
print("Data after swapping wealth values:")
df
Data after swapping wealth values:
latitude | longitude | wealth | latitude_combined | longitude_combined | wealth_combined | wealth_swapped | |
---|---|---|---|---|---|---|---|
0 | 34.05 | -118.25 | 50000 | 34.071089 | -118.253393 | Low | 120000 |
1 | 36.16 | -115.15 | 120000 | 36.166661 | -115.160829 | Medium | 75000 |
2 | 40.71 | -74.01 | 180000 | 40.720372 | -73.994760 | High | 180000 |
3 | 37.77 | -122.42 | 75000 | 37.772173 | -122.411337 | Low | 210000 |
4 | 34.05 | -118.25 | 210000 | 34.064031 | -118.240093 | NaN | 50000 |
Data swapping can help protect sensitive information, but excessive swapping might distort the relationships between attributes. The amount of swapping should be chosen carefully to balance privacy and data utility.