# Data Anonymization Techniques for Geospatial Data

This notebook demonstrates various techniques for anonymizing a dataset containing place names, latitude, longitude, and wealth values. Each technique has its own trade-offs between data privacy and utility, and the choice of method depends on the requirements for anonymization and analysis.

## 1. Aggregation/Generalization

Aggregation or generalization involves reducing the precision of geographic coordinates or grouping data into broader categories to make identification more difficult.

For example, latitude and longitude values can be rounded to reduce precision, making it harder to identify exact locations.

In [1]:
import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'place': ['A', 'B', 'C'],
    'latitude': [34.0522, 37.7749, 40.7128],
    'longitude': [-118.2437, -122.4194, -74.0060],
    'wealth': [100000, 150000, 200000]
})

# Generalizing coordinates by rounding to one decimal place
df['latitude_generalized'] = df['latitude'].round(1)
df['longitude_generalized'] = df['longitude'].round(1)

print("Data after generalization:")
df

Data after generalization:


Unnamed: 0,place,latitude,longitude,wealth,latitude_generalized,longitude_generalized
0,A,34.0522,-118.2437,100000,34.1,-118.2
1,B,37.7749,-122.4194,150000,37.8,-122.4
2,C,40.7128,-74.006,200000,40.7,-74.0


## 2. Adding Spatial Noise

Adding spatial noise involves randomly perturbing the latitude and longitude values. This can help anonymize the data while preserving overall geographic trends.

In [2]:
import random

# Function to add noise to latitude and longitude
def add_spatial_noise(lat, lon, noise_level=0.01):
    noisy_lat = lat + random.uniform(-noise_level, noise_level)
    noisy_lon = lon + random.uniform(-noise_level, noise_level)
    return noisy_lat, noisy_lon

# Applying the spatial noise function
df['latitude_noisy'], df['longitude_noisy'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))

print("Data after adding spatial noise:")
df

Data after adding spatial noise:


Unnamed: 0,place,latitude,longitude,wealth,latitude_generalized,longitude_generalized,latitude_noisy,longitude_noisy
0,A,34.0522,-118.2437,100000,34.1,-118.2,34.045844,-118.234098
1,B,37.7749,-122.4194,150000,37.8,-122.4,37.768044,-122.417715
2,C,40.7128,-74.006,200000,40.7,-74.0,40.706003,-74.013396


## 3. K-Anonymity for Spatial Data

K-anonymity ensures that each data point is indistinguishable from at least \(k-1\) other data points. This can be achieved by clustering the data and using the cluster centroids as anonymized locations.

In [None]:
# pip install scikit-mobility

In [7]:
import pandas as pd
from skmob import TrajDataFrame
from skmob.preprocessing import clustering

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.0522, 34.0523, 34.0524, 34.0525],
    'longitude': [-118.2437, -118.2438, -118.2439, -118.2440],
    'datetime': pd.date_range(start='2024-10-01', periods=4, freq='H')  # Sample datetime values
}
df = pd.DataFrame(data)

# Convert the DataFrame to a TrajDataFrame for clustering
tdf = TrajDataFrame(df, latitude='latitude', longitude='longitude', datetime='datetime')

# Applying clustering with a 1 km radius for k-anonymity
clustered_tdf = clustering.cluster(tdf, cluster_radius_km=1, min_samples=2)

print("Data after applying k-anonymity clustering:")
print(clustered_tdf)


Data after applying k-anonymity clustering:
       lat       lng            datetime  cluster
0  34.0522 -118.2437 2024-10-01 00:00:00        0
1  34.0523 -118.2438 2024-10-01 01:00:00        0
2  34.0524 -118.2439 2024-10-01 02:00:00        0
3  34.0525 -118.2440 2024-10-01 03:00:00        0


## 4. Binning Wealth Values

To anonymize sensitive numerical data, such as wealth, binning can be used. This involves converting continuous values into discrete categories.

In [9]:
import pandas as pd

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'wealth': [50000, 120000, 180000, 75000, 210000, 130000]
}
df = pd.DataFrame(data)

# Binning the wealth column into discrete ranges
df['wealth_binned'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])

print("Data after binning wealth values:")
print(df)


Data after binning wealth values:
   wealth wealth_binned
0   50000           Low
1  120000        Medium
2  180000          High
3   75000           Low
4  210000           NaN
5  130000        Medium


## 5. Data Masking for Place Names

Data masking replaces identifiable values, such as place names, with broader categories or randomly generated names to protect privacy.

In [11]:
#pip install faker

Collecting faker
  Downloading Faker-30.3.0-py3-none-any.whl.metadata (15 kB)
Downloading Faker-30.3.0-py3-none-any.whl (1.8 MB)
   ---------------------------------------- 0.0/1.8 MB ? eta -:--:--
   ------------------------ --------------- 1.1/1.8 MB 24.2 MB/s eta 0:00:01
   ---------------------------------------- 1.8/1.8 MB 23.4 MB/s eta 0:00:00
Installing collected packages: faker
Successfully installed faker-30.3.0
Note: you may need to restart the kernel to use updated packages.


In [12]:
from faker import Faker

fake = Faker()
# Replace place names with synthetic city names
df['place_synthetic'] = [fake.city() for _ in range(len(df))]

print("Data after masking place names:")
df

Data after masking place names:


Unnamed: 0,wealth,wealth_binned,place_synthetic
0,50000,Low,Frederickland
1,120000,Medium,East Cassandra
2,180000,High,Johnview
3,75000,Low,Smithberg
4,210000,,North Markstad
5,130000,Medium,Christopherstad


## 6. Using Synthetic Data Generation

Synthetic data generation involves creating new data points that mimic the statistical properties of the original dataset. This can help preserve the privacy of individuals while maintaining data utility for analysis.

In [16]:
import pandas as pd
import numpy as np

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
    'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
    'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)

# Generate synthetic latitude, longitude, and wealth values
df['latitude_synthetic'] = np.random.normal(df['latitude'].mean(), df['latitude'].std(), size=len(df))
df['longitude_synthetic'] = np.random.normal(df['longitude'].mean(), df['longitude'].std(), size=len(df))
df['wealth_synthetic'] = np.random.normal(df['wealth'].mean(), df['wealth'].std(), size=len(df))

print("Data after generating synthetic values:")
print(df)


Data after generating synthetic values:
   latitude  longitude  wealth  latitude_synthetic  longitude_synthetic  \
0     34.05    -118.25   50000           34.330063          -117.593164   
1     36.16    -115.15  120000           39.449315          -107.108012   
2     40.71     -74.01  180000           39.002840          -141.420115   
3     37.77    -122.42   75000           40.138219          -110.544666   
4     34.05    -118.25  210000           38.505326           -96.131315   

   wealth_synthetic  
0      45922.094082  
1      99388.697132  
2     153051.236689  
3     139571.492688  
4     324247.417247  


## 7. Combining Techniques

Combining multiple anonymization techniques can provide a higher level of privacy protection. For example, spatial noise can be added along with binning of wealth values.

In [17]:
import pandas as pd
import numpy as np

# Sample data creation (Make sure to replace this with your actual data)
data = {
    'latitude': [34.05, 36.16, 40.71, 37.77, 34.05],
    'longitude': [-118.25, -115.15, -74.01, -122.42, -118.25],
    'wealth': [50000, 120000, 180000, 75000, 210000]
}
df = pd.DataFrame(data)

# Function to add spatial noise
def add_spatial_noise(latitude, longitude, noise_level=0.01):
    # Adding noise to latitude and longitude
    noisy_latitude = latitude + np.random.normal(0, noise_level)
    noisy_longitude = longitude + np.random.normal(0, noise_level)
    return noisy_latitude, noisy_longitude

# Adding spatial noise and binning wealth values
df['latitude_combined'], df['longitude_combined'] = zip(*df.apply(lambda row: add_spatial_noise(row['latitude'], row['longitude']), axis=1))
df['wealth_combined'] = pd.cut(df['wealth'], bins=[0, 100000, 150000, 200000], labels=['Low', 'Medium', 'High'])

print("Data after combining anonymization techniques:")
print(df)


Data after combining anonymization techniques:
   latitude  longitude  wealth  latitude_combined  longitude_combined  \
0     34.05    -118.25   50000          34.053350         -118.246036   
1     36.16    -115.15  120000          36.140220         -115.146150   
2     40.71     -74.01  180000          40.700640          -74.006085   
3     37.77    -122.42   75000          37.772943         -122.415578   
4     34.05    -118.25  210000          34.034992         -118.249693   

  wealth_combined  
0             Low  
1          Medium  
2            High  
3             Low  
4             NaN  


## 8. Data Swapping

Data swapping involves exchanging values of sensitive attributes (e.g., wealth) between different records. This technique helps anonymize the data while preserving the overall statistical distribution.

For example, we can randomly swap the wealth values between different locations in the dataset.

In [15]:
import numpy as np

# Function to swap values in the wealth column randomly
def swap_values(column):
    shuffled = column.sample(frac=1).values
    return shuffled

# Apply swapping to the 'wealth' column
df['wealth_swapped'] = swap_values(df['wealth'])

print("Data after swapping wealth values:")
df

Data after swapping wealth values:


Unnamed: 0,wealth,wealth_binned,place_synthetic,wealth_swapped
0,50000,Low,Frederickland,120000
1,120000,Medium,East Cassandra,50000
2,180000,High,Johnview,180000
3,75000,Low,Smithberg,75000
4,210000,,North Markstad,130000
5,130000,Medium,Christopherstad,210000


Data swapping can help protect sensitive information, but excessive swapping might distort the relationships between attributes. The amount of swapping should be chosen carefully to balance privacy and data utility.