import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")
# Core image processing
from PIL import Image
import io
# Deep learning
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models
# Visualization
import plotnine as p9
from mizani.formatters import percent_format, comma_format
import matplotlib.pyplot as plt
# Statistical analysis
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.panel import PanelOLS45 Image and Visual Data in Finance
The previous chapter demonstrated how unstructured text (e.g., earnings reports, news articles, business descriptions) can be transformed into structured signals for financial analysis. This chapter extends the alternative data toolkit to a second modality: images. Visual data is abundant in financial contexts yet systematically underexploited. Satellite photographs reveal real economic activity (e.g., parking lot occupancy at retail locations, construction progress at industrial sites, nighttime luminosity as a proxy for regional GDP, ship traffic at port terminals, crop health across agricultural zones). Corporate documents arrive as scanned PDFs whose tables and figures resist standard text extraction. Financial charts encode information that analysts interpret visually but that systematic strategies cannot consume without digitization. And the visual content of social media, advertising, and product imagery carries sentiment and brand signals that complement textual analysis.
The core challenge is representational: an image is a three-dimensional tensor of pixel intensities with no inherent semantic structure. Converting this raw array into a financial signal (i.e., a number that predicts returns, measures risk, or proxies for economic activity) requires either hand-crafted feature engineering or learned representations via deep convolutional neural networks (CNNs) and vision transformers (ViTs). This chapter covers both approaches.
We organize the material around five application domains, each with distinct data sources, modeling requirements, and economic motivations. First, satellite and geospatial imagery for nowcasting economic activity. Second, document image analysis for extracting structured data from Vietnamese financial filings. Third, chart and figure digitization for systematic backtesting. Fourth, visual sentiment analysis from social and news media. Fifth, multimodal fusion, combining image and text signals into joint predictive models.
Vietnamese markets present particular opportunities in this space. Satellite imagery is especially informative in an economy with large agricultural and manufacturing sectors where ground-truth data arrives with significant lags. Vietnamese financial filings are often distributed as scanned images rather than machine-readable formats, making document AI essential rather than optional. And the rapid urbanization visible in construction and infrastructure imagery provides high-frequency proxies for macroeconomic momentum that official statistics cannot match.
45.1 Foundations: From Pixels to Financial Signals
45.1.1 Image Representation
A digital image is a function \(I: \{1, \ldots, H\} \times \{1, \ldots, W\} \times \{1, \ldots, C\} \rightarrow [0, 255]\) mapping spatial coordinates and color channels to intensity values. For an RGB image of height \(H\) and width \(W\), the representation is a tensor \(\mathbf{I} \in \mathbb{R}^{H \times W \times 3}\). A single \(224 \times 224\) RGB image (i.e., the standard input for modern CNNs) contains \(224 \times 224 \times 3 = 150{,}528\) dimensions. This extreme dimensionality, combined with spatial structure (nearby pixels are correlated), makes images fundamentally different from tabular financial data and demands specialized architectures.
The key insight of convolutional neural networks is parameter sharing via local filters. A convolutional layer applies a kernel \(\mathbf{K} \in \mathbb{R}^{k \times k \times C_{\text{in}}}\) to produce a feature map:
\[ (\mathbf{I} * \mathbf{K})(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=1}^{C_{\text{in}}} I(i+m, j+n, c) \cdot K(m, n, c) \tag{45.1}\]
By stacking convolutional layers with nonlinearities and pooling operations, the network builds a hierarchy of representations: early layers detect edges and textures; middle layers detect parts and patterns; deep layers detect objects and scenes. The final layer output \(\mathbf{z} \in \mathbb{R}^{d}\) (with \(d\) typically 512-2048) is a compact representation of the image’s semantic content, which can be used directly as a feature vector for financial prediction.
Financial images span a wide range of resolutions and modalities:
| Source | Resolution | Channels | Typical Size | Update Frequency |
|---|---|---|---|---|
| Sentinel-2 satellite | 10m/pixel | 13 bands | 10,980 × 10,980 | 5 days |
| Planet Labs | 3m/pixel | 4 bands | 4,000 × 4,000 | Daily |
| VIIRS nightlights | 500m/pixel | 1 (DNB) | 3,000 × 1,800 | Monthly composite |
| Annual report scan | 300 DPI | 3 (RGB) | 2,480 × 3,508 | Annual |
| CEO photograph | Varies | 3 (RGB) | 500 × 500 | Annual |
| News photograph | Varies | 3 (RGB) | 800 × 600 | Real-time |
| Financial chart | Varies | 3 (RGB) | 1,000 × 600 | Real-time |
45.1.2 Transfer Learning for Finance
Training a CNN from scratch requires millions of labeled images, which is far more than any financial application can provide. Transfer learning solves this by using networks pre-trained on ImageNet (1.2 million images, 1,000 classes) as feature extractors. The pre-trained network has already learned generic visual representations (edges, textures, shapes, objects); we simply replace the final classification layer with a task-specific head.
Formally, let \(f_{\boldsymbol{\theta}}(\mathbf{I})\) denote a pre-trained network with parameters \(\boldsymbol{\theta}\) partitioned into feature extractor \(\boldsymbol{\theta}_{\text{feat}}\) and classifier \(\boldsymbol{\theta}_{\text{cls}}\). For financial applications, we:
- Feature extraction: Freeze \(\boldsymbol{\theta}_{\text{feat}}\), extract \(\mathbf{z} = f_{\boldsymbol{\theta}_{\text{feat}}}(\mathbf{I})\), and train a simple model (linear regression, gradient boosting) on \(\mathbf{z}\).
- Fine-tuning: Initialize from \(\boldsymbol{\theta}\) and train all parameters on the financial task with a small learning rate to avoid catastrophic forgetting.
The key architectural families are:
ResNet [He et al. (2016)]. Residual connections (\(y = F(x) + x\)) enable training of very deep networks (50-152 layers). The skip connection solves the vanishing gradient problem. ResNet-50 produces a 2,048-dimensional feature vector from the penultimate layer.
EfficientNet [Tan and Le (2019)]. Compound scaling of depth, width, and resolution simultaneously. EfficientNet-B0 achieves ResNet-50 accuracy with 5.3M parameters (vs. 25.6M), making it practical for processing thousands of satellite tiles.
Vision Transformer (ViT) [Dosovitskiy et al. (2020)]. Treats an image as a sequence of \(16 \times 16\) patches, processes them through a standard Transformer encoder. ViT-B/16 produces a 768-dimensional embedding. Particularly effective for document images where spatial relationships between elements (tables, headers, text blocks) matter.
# DataCore.vn API
from datacore import DataCore
dc = DataCore()
def build_feature_extractor(model_name="resnet50", device="cpu"):
"""
Build a pre-trained CNN feature extractor.
Parameters
----------
model_name : str
One of 'resnet50', 'efficientnet_b0', 'vit_b_16'.
device : str
'cpu' or 'cuda'.
Returns
-------
model : nn.Module
Feature extraction model.
transform : transforms.Compose
Image preprocessing pipeline.
"""
if model_name == "resnet50":
weights = models.ResNet50_Weights.IMAGENET1K_V2
model = models.resnet50(weights=weights)
model.fc = nn.Identity() # Remove classification head
dim = 2048
elif model_name == "efficientnet_b0":
weights = models.EfficientNet_B0_Weights.IMAGENET1K_V1
model = models.efficientnet_b0(weights=weights)
model.classifier = nn.Identity()
dim = 1280
elif model_name == "vit_b_16":
weights = models.ViT_B_16_Weights.IMAGENET1K_V1
model = models.vit_b_16(weights=weights)
model.heads = nn.Identity()
dim = 768
model = model.to(device).eval()
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]
)
])
return model, transform, dim
def extract_features(image_paths, model, transform, device="cpu",
batch_size=32):
"""
Extract deep features from a list of images.
Parameters
----------
image_paths : list
Paths to image files.
model : nn.Module
Feature extraction model.
transform : transforms.Compose
Preprocessing pipeline.
Returns
-------
np.ndarray : Feature matrix (n_images x feature_dim).
"""
features = []
for i in range(0, len(image_paths), batch_size):
batch_paths = image_paths[i:i + batch_size]
batch_tensors = []
for path in batch_paths:
try:
img = Image.open(path).convert("RGB")
tensor = transform(img)
batch_tensors.append(tensor)
except Exception:
batch_tensors.append(torch.zeros(3, 224, 224))
batch = torch.stack(batch_tensors).to(device)
with torch.no_grad():
batch_features = model(batch).cpu().numpy()
features.append(batch_features)
return np.vstack(features)45.2 Satellite and Geospatial Imagery
45.2.1 Economic Activity from Space
Satellite imagery provides high-frequency, spatially granular measurements of economic activity that are independent of and often lead official statistics. The foundational work of Henderson, Storeygard, and Weil (2012) demonstrated that nighttime luminosity, measured by the Defense Meteorological Satellite Program (DMSP), is a reliable proxy for GDP, particularly in countries where official statistics are noisy or delayed. Donaldson (2018) use satellite-derived agricultural output measures to study the welfare gains from railroads in colonial India. Jean et al. (2016) combine daytime satellite imagery with CNNs to predict poverty from space with \(r^2 > 0.7\).
For financial applications, the key insight is that satellite data arrives faster than corporate earnings or government statistics. A retailer’s quarterly revenue is reported 4-8 weeks after the quarter ends; satellite imagery of its parking lots is available within days. This temporal advantage creates a natural use case for nowcasting (i.e., estimating current economic conditions before official data arrives) and for constructing trading signals based on information that is public but costly to process.
45.2.2 Application 1: Nighttime Luminosity and Provincial GDP
Vietnam’s General Statistics Office (GSO) publishes provincial GDP with a lag of several months. Nighttime luminosity from the VIIRS (Visible Infrared Imaging Radiometer Suite) sensor provides a near-real-time alternative. We construct a firm-level exposure measure by linking each listed firm’s registered location to the luminosity of its province.
# Load nighttime luminosity data (VIIRS monthly composites)
# Source: Earth Observation Group (EOG) / NOAA
nightlights = dc.get_nightlight_data(
start_date="2014-01-01",
end_date="2024-12-31",
resolution="province"
)
# Load firm location data
firm_locations = dc.get_firm_locations()
# Provincial GDP from GSO
provincial_gdp = dc.get_provincial_gdp(
start_date="2014-01-01",
end_date="2024-12-31"
)
print(f"Nightlight observations: {len(nightlights)}")
print(f"Provinces covered: {nightlights['province'].nunique()}")
print(f"Firms with location: {firm_locations['ticker'].nunique()}")# Validate: does nightlight predict provincial GDP?
nl_gdp = nightlights.merge(
provincial_gdp,
on=["province", "year", "quarter"],
how="inner"
)
# Log-log specification (standard in the literature)
nl_gdp["ln_luminosity"] = np.log(nl_gdp["mean_radiance"].clip(lower=0.01))
nl_gdp["ln_gdp"] = np.log(nl_gdp["provincial_gdp"].clip(lower=1))
# Cross-sectional regression by year
validation_results = []
for year in nl_gdp["year"].unique():
subset = nl_gdp[nl_gdp["year"] == year]
if len(subset) < 20:
continue
model = sm.OLS(
subset["ln_gdp"],
sm.add_constant(subset["ln_luminosity"])
).fit()
validation_results.append({
"year": year,
"beta": model.params.iloc[1],
"r_squared": model.rsquared,
"n_provinces": int(model.nobs)
})
validation_df = pd.DataFrame(validation_results)
print(f"Avg R² (ln GDP ~ ln Luminosity): {validation_df['r_squared'].mean():.3f}")(
p9.ggplot(nl_gdp[nl_gdp["year"] == 2023],
p9.aes(x="ln_luminosity", y="ln_gdp"))
+ p9.geom_point(color="#2E5090", alpha=0.6, size=2)
+ p9.geom_smooth(method="lm", color="#C0392B", se=True, size=0.8)
+ p9.labs(
x="ln(Mean Nighttime Radiance)",
y="ln(Provincial GDP)",
title="Nighttime Luminosity vs. Provincial GDP (2023)"
)
+ p9.theme_minimal()
+ p9.theme(figure_size=(10, 6))
)# Construct firm-level nightlight signal
# Logic: firms in provinces with accelerating luminosity
# are experiencing positive local economic conditions
nl_growth = nightlights.copy().sort_values(["province", "year", "quarter"])
# Year-over-year luminosity growth by province
nl_growth["ln_radiance"] = np.log(nl_growth["mean_radiance"].clip(lower=0.01))
nl_growth["ln_radiance_lag4"] = nl_growth.groupby(
"province"
)["ln_radiance"].shift(4)
nl_growth["nl_growth"] = nl_growth["ln_radiance"] - nl_growth["ln_radiance_lag4"]
# Merge with firms via province
firm_nl = firm_locations[["ticker", "province"]].merge(
nl_growth[["province", "year", "quarter", "nl_growth"]],
on="province",
how="inner"
)
# Merge with stock returns
monthly_returns = dc.get_monthly_returns(
start_date="2014-01-01",
end_date="2024-12-31"
)
monthly_returns["year"] = monthly_returns["date"].dt.year
monthly_returns["quarter"] = monthly_returns["date"].dt.quarter
returns_with_nl = monthly_returns.merge(
firm_nl,
on=["ticker", "year", "quarter"],
how="inner"
)# Portfolio sort: quintiles on provincial nightlight growth
returns_with_nl["nl_quintile"] = returns_with_nl.groupby("date")[
"nl_growth"
].transform(lambda x: pd.qcut(x, 5, labels=[1, 2, 3, 4, 5],
duplicates="drop"))
nl_port_returns = (
returns_with_nl.groupby(["date", "nl_quintile"])
.agg(port_ret=("ret", "mean"))
.reset_index()
)
nl_wide = nl_port_returns.pivot(
index="date", columns="nl_quintile", values="port_ret"
)
nl_wide["L-S"] = nl_wide[5] - nl_wide[1] # High NL growth - Lownl_summary = nl_wide.describe().T[["mean", "std"]].copy()
nl_summary["mean_ann"] = nl_summary["mean"] * 12
nl_summary["sharpe"] = (
nl_summary["mean_ann"] / (nl_summary["std"] * np.sqrt(12))
)
for col in nl_wide.columns:
t_stat = nl_wide[col].mean() / (
nl_wide[col].std() / np.sqrt(len(nl_wide.dropna()))
)
nl_summary.loc[col, "t_stat"] = t_stat
nl_summary = nl_summary[["mean_ann", "sharpe", "t_stat"]].round(4)
nl_summary.columns = ["Ann. Return", "Sharpe", "t-stat"]
nl_summary45.2.3 Application 2: Satellite Imagery for Sector Nowcasting
Beyond luminosity, daytime satellite imagery provides sector-specific signals. We implement three channels relevant to the Vietnamese economy.
Port activity. Vietnam is a major export-oriented economy. Satellite imagery of container ports (Cát Lái, Hải Phòng) captures trade throughput before customs statistics are released. Ship detection algorithms applied to synthetic aperture radar (SAR) imagery count vessels and estimate cargo volumes.
Construction progress. Real estate and construction constitute a significant fraction of Vietnamese GDP and market capitalization. Change detection algorithms applied to high-resolution optical imagery identify construction starts, completion rates, and land-use conversion.
Agricultural monitoring. Vietnam is a leading exporter of rice, coffee, rubber, and seafood. The Normalized Difference Vegetation Index (NDVI), computed from multispectral satellite data, provides crop health assessments:
\[ \text{NDVI} = \frac{\rho_{\text{NIR}} - \rho_{\text{Red}}}{\rho_{\text{NIR}} + \rho_{\text{Red}}} \tag{45.2}\]
where \(\rho_{\text{NIR}}\) and \(\rho_{\text{Red}}\) are reflectance in the near-infrared and red bands. NDVI ranges from \(-1\) to \(+1\), with values above 0.3 indicating healthy vegetation. Deviations from seasonal norms proxy for crop yield surprises.
# Load NDVI data for Vietnamese agricultural regions
# Source: MODIS/Terra (MOD13Q1, 250m resolution, 16-day composites)
ndvi_data = dc.get_ndvi_data(
start_date="2014-01-01",
end_date="2024-12-31",
regions=["mekong_delta", "central_highlands",
"red_river_delta", "southeast"]
)
# Compute NDVI anomaly: deviation from 5-year seasonal average
ndvi_data["month"] = ndvi_data["date"].dt.month
seasonal_mean = (
ndvi_data.groupby(["region", "month"])
["mean_ndvi"].transform(
lambda x: x.rolling(5 * 12, min_periods=12).mean()
)
)
ndvi_data["ndvi_anomaly"] = ndvi_data["mean_ndvi"] - seasonal_mean
# Agricultural sector firms
agri_firms = dc.get_firms_by_sector(sector="agriculture")
# Link NDVI anomaly to agricultural firm returns
agri_returns = monthly_returns[
monthly_returns["ticker"].isin(agri_firms["ticker"])
].copy()
agri_returns["month"] = agri_returns["date"].dt.month
agri_returns["year"] = agri_returns["date"].dt.year
# Regional NDVI aggregation (Mekong Delta for rice firms, etc.)
mekong_ndvi = ndvi_data[ndvi_data["region"] == "mekong_delta"].copy()
mekong_ndvi["year"] = mekong_ndvi["date"].dt.year
mekong_ndvi["month"] = mekong_ndvi["date"].dt.month
mekong_monthly = (
mekong_ndvi.groupby(["year", "month"])
.agg(ndvi_anomaly=("ndvi_anomaly", "mean"))
.reset_index()
)ndvi_plot = ndvi_data[ndvi_data["region"] == "mekong_delta"].copy()
(
p9.ggplot(ndvi_plot, p9.aes(x="date", y="ndvi_anomaly"))
+ p9.geom_line(color="#27AE60", alpha=0.5, size=0.4)
+ p9.geom_smooth(method="lowess", color="#2E5090", size=1, se=False)
+ p9.geom_hline(yintercept=0, linetype="dashed", color="gray")
+ p9.labs(
x="",
y="NDVI Anomaly",
title="Mekong Delta Vegetation Health: Deviation from Seasonal Norm"
)
+ p9.theme_minimal()
+ p9.theme(figure_size=(12, 5))
)# Panel regression: agricultural firm returns on NDVI anomaly
agri_panel = agri_returns.merge(
mekong_monthly,
on=["year", "month"],
how="inner"
)
# Lagged NDVI anomaly (one month)
agri_panel = agri_panel.sort_values(["ticker", "date"])
agri_panel["ndvi_lag1"] = agri_panel.groupby(
"ticker"
)["ndvi_anomaly"].shift(1)
agri_clean = agri_panel.dropna(
subset=["ret", "ndvi_lag1"]
).set_index(["ticker", "date"])
model_ndvi = PanelOLS(
agri_clean["ret"],
agri_clean[["ndvi_lag1"]],
entity_effects=True,
time_effects=True,
check_rank=False
).fit(cov_type="clustered", cluster_entity=True)
agri_clean = agri_clean.reset_index()
print(f"NDVI → Agricultural Returns:")
print(f" β(NDVI_lag): {model_ndvi.params['ndvi_lag1']:.4f}")
print(f" t-stat: {model_ndvi.tstats['ndvi_lag1']:.3f}")
print(f" R² (within): {model_ndvi.rsquared_within:.4f}")45.2.4 Satellite Feature Extraction with CNNs
For raw satellite imagery (rather than pre-computed indices like NDVI), we use transfer learning from CNNs to extract spatial features. The approach follows Jean et al. (2016): use a CNN pre-trained on ImageNet to extract feature vectors from satellite tiles, then regress economic outcomes on these features.
def satellite_feature_pipeline(image_dir, model_name="resnet50"):
"""
Extract CNN features from satellite image tiles.
Parameters
----------
image_dir : str or Path
Directory containing satellite tiles (PNG/TIFF).
model_name : str
Pre-trained model to use.
Returns
-------
DataFrame : image_id, feature vector columns.
"""
image_dir = Path(image_dir)
image_paths = sorted(image_dir.glob("*.png")) + sorted(
image_dir.glob("*.tif")
)
if not image_paths:
print("No images found.")
return pd.DataFrame()
device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform, dim = build_feature_extractor(model_name, device)
features = extract_features(image_paths, model, transform, device)
# Create DataFrame
feature_cols = [f"feat_{i}" for i in range(dim)]
df = pd.DataFrame(features, columns=feature_cols)
df["image_id"] = [p.stem for p in image_paths]
return df
def predict_economic_activity(features_df, labels_df, label_col,
n_components=50):
"""
Predict economic activity from satellite image features.
Uses PCA for dimensionality reduction, then ridge regression.
Parameters
----------
features_df : DataFrame
CNN features with image_id.
labels_df : DataFrame
Economic outcomes with image_id.
label_col : str
Target variable column name.
n_components : int
PCA components to retain.
Returns
-------
dict : R², coefficients, cross-validated performance.
"""
from sklearn.decomposition import PCA
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import cross_val_score
merged = features_df.merge(labels_df, on="image_id")
feature_cols = [c for c in features_df.columns if c.startswith("feat_")]
X = merged[feature_cols].values
y = merged[label_col].values
# PCA
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
var_explained = pca.explained_variance_ratio_.sum()
# Ridge regression with cross-validation
ridge = RidgeCV(alphas=np.logspace(-3, 3, 20), cv=5)
cv_scores = cross_val_score(ridge, X_pca, y, cv=5, scoring="r2")
ridge.fit(X_pca, y)
return {
"r2_cv_mean": cv_scores.mean(),
"r2_cv_std": cv_scores.std(),
"r2_train": ridge.score(X_pca, y),
"pca_var_explained": var_explained,
"optimal_alpha": ridge.alpha_,
"n_images": len(merged)
}45.3 Document Image Analysis
45.3.1 The Vietnamese Filing Problem
A substantial fraction of Vietnamese corporate disclosures (e.g., annual reports, financial statements, board resolutions, shareholder meeting minutes) are distributed as scanned PDF images rather than machine-readable text. This creates a data extraction bottleneck: the information exists but is trapped in pixel format. Unlike filings in more developed markets (where XBRL mandates ensure machine readability), Vietnamese filings require Optical Character Recognition (OCR) and layout analysis before any quantitative analysis can begin.
The document AI pipeline for Vietnamese financial filings involves four stages:
- Page classification: Identify which pages contain financial statements, management discussion, audit opinions, etc.
- Layout analysis: Detect the spatial structure such as headers, paragraphs, tables, figures, captions.
- OCR: Convert image regions to text, using Vietnamese-optimized models.
- Structured extraction: Parse the recognized text into structured data (e.g., revenue figures, balance sheet items).
45.3.2 OCR for Vietnamese Financial Documents
Standard OCR engines (Tesseract, Google Cloud Vision) struggle with Vietnamese financial documents due to the combination of Vietnamese diacritics (ă, ơ, ư, ê, etc.), mixed Vietnamese-English content, and complex table layouts. We implement a pipeline using PaddleOCR (which has strong CJK and Southeast Asian language support) and VietOCR (a Vietnamese-specific model based on the transformer architecture of Baek et al. (2019)).
def ocr_financial_document(pdf_path, language="vi",
engine="paddleocr"):
"""
OCR a Vietnamese financial document (scanned PDF).
Parameters
----------
pdf_path : str
Path to PDF file.
language : str
Language code.
engine : str
'paddleocr' or 'vietocr'.
Returns
-------
list[dict] : Per-page OCR results with bounding boxes.
"""
from pdf2image import convert_from_path
# Convert PDF pages to images
pages = convert_from_path(pdf_path, dpi=300)
results = []
if engine == "paddleocr":
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang="vi", use_gpu=False)
for page_num, page_img in enumerate(pages):
# Convert PIL to numpy
img_array = np.array(page_img)
ocr_result = ocr.ocr(img_array, cls=True)
page_texts = []
for line in ocr_result[0]:
bbox, (text, confidence) = line
page_texts.append({
"text": text,
"confidence": confidence,
"bbox": bbox,
"page": page_num + 1
})
results.extend(page_texts)
return results
def classify_page_type(ocr_results, page_num):
"""
Classify a document page by content type using keyword matching.
Returns one of: 'balance_sheet', 'income_statement',
'cash_flow', 'notes', 'audit', 'management', 'other'.
"""
page_text = " ".join(
[r["text"] for r in ocr_results if r["page"] == page_num]
).lower()
# Vietnamese financial statement keywords
keyword_map = {
"balance_sheet": [
"bảng cân đối kế toán", "tài sản", "nguồn vốn",
"nợ phải trả", "vốn chủ sở hữu"
],
"income_statement": [
"kết quả hoạt động kinh doanh", "doanh thu",
"lợi nhuận", "chi phí", "thu nhập"
],
"cash_flow": [
"lưu chuyển tiền tệ", "dòng tiền",
"hoạt động kinh doanh", "hoạt động đầu tư"
],
"audit": [
"báo cáo kiểm toán", "kiểm toán viên",
"ý kiến kiểm toán", "trung thực và hợp lý"
],
"management": [
"ban giám đốc", "hội đồng quản trị",
"báo cáo thường niên", "tình hình hoạt động"
]
}
scores = {}
for page_type, keywords in keyword_map.items():
scores[page_type] = sum(
1 for kw in keywords if kw in page_text
)
if max(scores.values()) == 0:
return "other"
return max(scores, key=scores.get)45.3.3 Table Extraction from Financial Statements
The highest-value extraction task is recovering structured tables from financial statements. We implement a two-stage approach: first detect table regions using a layout analysis model, then parse the detected regions into row-column structure.
def extract_tables_from_page(page_image, ocr_results, page_num):
"""
Extract structured tables from a document page.
Uses spatial clustering of OCR bounding boxes to identify
table regions, then aligns text into rows and columns.
Parameters
----------
page_image : PIL.Image
Page image.
ocr_results : list[dict]
OCR results for this page.
page_num : int
Page number.
Returns
-------
list[pd.DataFrame] : Extracted tables as DataFrames.
"""
page_texts = [r for r in ocr_results if r["page"] == page_num]
if not page_texts:
return []
# Extract bounding box centers
centers = []
for item in page_texts:
bbox = item["bbox"]
# bbox is [[x1,y1],[x2,y2],[x3,y3],[x4,y4]]
cx = np.mean([p[0] for p in bbox])
cy = np.mean([p[1] for p in bbox])
centers.append((cx, cy, item["text"]))
if not centers:
return []
centers_df = pd.DataFrame(centers, columns=["x", "y", "text"])
# Cluster into rows by y-coordinate proximity
centers_df = centers_df.sort_values("y")
row_threshold = 15 # pixels
centers_df["row_id"] = (
centers_df["y"].diff().abs() > row_threshold
).cumsum()
# Within each row, sort by x-coordinate
tables = []
rows = []
for row_id, row_group in centers_df.groupby("row_id"):
row_sorted = row_group.sort_values("x")
rows.append(row_sorted["text"].tolist())
if len(rows) > 2:
# Attempt to construct DataFrame
max_cols = max(len(r) for r in rows)
# Pad shorter rows
padded = [r + [""] * (max_cols - len(r)) for r in rows]
try:
df = pd.DataFrame(padded[1:], columns=padded[0])
tables.append(df)
except Exception:
tables.append(pd.DataFrame(padded))
return tables
def parse_financial_numbers(text):
"""
Parse Vietnamese financial number formats.
Vietnamese uses dots as thousands separators and commas as decimals.
E.g., '1.234.567' = 1234567, '1.234,56' = 1234.56
"""
import re
text = text.strip().replace(" ", "")
# Remove parentheses (negative indicator)
negative = text.startswith("(") and text.endswith(")")
text = text.strip("()")
# Handle Vietnamese number format
# If comma is present, it's a decimal separator
if "," in text:
text = text.replace(".", "").replace(",", ".")
else:
text = text.replace(".", "")
try:
value = float(text)
return -value if negative else value
except ValueError:
return np.nan45.3.4 Layout-Aware Document Understanding
Modern document AI goes beyond OCR by jointly modeling text content and spatial layout. LayoutLM (Huang et al. 2022) and its successors treat each token as having both a text embedding and a positional embedding derived from its bounding box coordinates. This allows the model to understand that a number positioned below a “Revenue” header and to the right of “2023” is the 2023 revenue figure, even without explicit table detection.
def layoutlm_extract(document_pages, model_name="layoutlmv3"):
"""
Extract structured financial data using LayoutLM.
This function uses the pre-trained LayoutLMv3 model for
document understanding with Vietnamese financial statements.
Parameters
----------
document_pages : list
List of (page_image, ocr_results) tuples.
model_name : str
Model variant.
Returns
-------
dict : Extracted financial fields.
"""
from transformers import (
LayoutLMv3ForTokenClassification,
LayoutLMv3Processor
)
processor = LayoutLMv3Processor.from_pretrained(
"microsoft/layoutlmv3-base",
apply_ocr=False # We provide our own OCR
)
model = LayoutLMv3ForTokenClassification.from_pretrained(
"microsoft/layoutlmv3-base",
num_labels=13 # Financial statement field types
)
# Define target fields for extraction
field_labels = [
"O", # Other
"B-REVENUE", "I-REVENUE",
"B-COGS", "I-COGS",
"B-NET_INCOME", "I-NET_INCOME",
"B-TOTAL_ASSETS", "I-TOTAL_ASSETS",
"B-TOTAL_EQUITY", "I-TOTAL_EQUITY",
"B-TOTAL_DEBT", "I-TOTAL_DEBT"
]
extracted = {}
for page_img, ocr_results in document_pages:
words = [r["text"] for r in ocr_results]
boxes = []
for r in ocr_results:
bbox = r["bbox"]
# Normalize to 0-1000 range
x0 = min(p[0] for p in bbox)
y0 = min(p[1] for p in bbox)
x1 = max(p[0] for p in bbox)
y1 = max(p[1] for p in bbox)
boxes.append([int(x0), int(y0), int(x1), int(y1)])
if not words:
continue
# Process through LayoutLM
encoding = processor(
page_img,
words,
boxes=boxes,
return_tensors="pt",
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
# Extract labeled entities
for idx, pred in enumerate(predictions):
if pred > 0 and idx < len(words):
label = field_labels[pred]
if label.startswith("B-"):
field = label[2:]
value = parse_financial_numbers(words[idx])
if not np.isnan(value):
extracted[field] = value
return extracted45.4 Chart and Figure Digitization
45.4.1 Motivation: Unlocking Visual Financial Data
Financial charts (e.g., price time series, bar charts of earnings, scatter plots of risk-return tradeoffs) embed information that analysts process visually. For systematic strategies, this information must be converted to numerical form. Three use cases motivate chart digitization:
- Historical data recovery. Pre-digital financial data often exists only in printed charts. Digitizing these charts extends historical time series beyond the electronic era.
- Broker report extraction. Sell-side research reports contain charts with projections and scenario analyses. Extracting these programmatically enables systematic aggregation of analyst views.
- Regulatory filings. Vietnamese regulatory filings sometimes embed data as images (charts, scanned tables) rather than as machine-readable values.
45.4.2 Chart Type Classification
The first step is classifying the chart type (line, bar, scatter, pie, candlestick), which determines the appropriate digitization algorithm.
def build_chart_classifier(n_classes=5):
"""
Build a CNN-based chart type classifier.
Classes: line_chart, bar_chart, scatter_plot,
candlestick, pie_chart.
"""
model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)
# Replace final layer for chart classification
model.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(512, n_classes)
)
return model
def classify_chart(image_path, model, transform):
"""Classify a chart image into one of 5 types."""
class_names = [
"line_chart", "bar_chart", "scatter_plot",
"candlestick", "pie_chart"
]
img = Image.open(image_path).convert("RGB")
tensor = transform(img).unsqueeze(0)
with torch.no_grad():
logits = model(tensor)
probs = torch.softmax(logits, dim=1).squeeze()
pred_idx = probs.argmax().item()
return {
"predicted_class": class_names[pred_idx],
"confidence": probs[pred_idx].item(),
"all_probs": {
name: probs[i].item()
for i, name in enumerate(class_names)
}
}45.4.3 Line Chart Digitization
For line charts, the digitization task is to recover the \((x, y)\) data series from the image. The pipeline involves axis detection, scale calibration, and curve tracing.
def digitize_line_chart(image_path, x_range=None, y_range=None):
"""
Digitize a line chart image to recover the data series.
Parameters
----------
image_path : str
Path to chart image.
x_range : tuple, optional
(x_min, x_max) if known.
y_range : tuple, optional
(y_min, y_max) if known.
Returns
-------
DataFrame : Digitized data points (x, y).
"""
import cv2
img = cv2.imread(str(image_path))
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
h, w = gray.shape
# Step 1: Detect plot area (largest rectangular region)
edges = cv2.Canny(gray, 50, 150)
contours, _ = cv2.findContours(
edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
)
if contours:
largest = max(contours, key=cv2.contourArea)
x_start, y_start, plot_w, plot_h = cv2.boundingRect(largest)
else:
# Fallback: assume plot is central 80% of image
x_start, y_start = int(w * 0.1), int(h * 0.1)
plot_w, plot_h = int(w * 0.8), int(h * 0.8)
# Step 2: Extract line pixels within plot area
# Convert to HSV and isolate colored lines
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
plot_region = hsv[y_start:y_start + plot_h,
x_start:x_start + plot_w]
# Detect non-white, non-gray pixels (likely the line)
saturation = plot_region[:, :, 1]
line_mask = saturation > 30 # Colored pixels
# Step 3: Trace the line (column-wise median of colored pixels)
data_points = []
for col in range(plot_w):
col_pixels = np.where(line_mask[:, col])[0]
if len(col_pixels) > 0:
# Use median y-position
y_pixel = np.median(col_pixels)
# Convert pixel to data coordinates
x_frac = col / plot_w
y_frac = 1 - y_pixel / plot_h # Invert y-axis
x_val = (x_range[0] + x_frac * (x_range[1] - x_range[0])
if x_range else x_frac)
y_val = (y_range[0] + y_frac * (y_range[1] - y_range[0])
if y_range else y_frac)
data_points.append({"x": x_val, "y": y_val})
return pd.DataFrame(data_points)45.5 Visual Sentiment Analysis
45.5.1 Image Sentiment in Financial News
News articles are accompanied by images that carry sentiment independent of the text. A photograph of a CEO smiling at a press conference conveys different information than the same CEO facing protesters. Obaid and Pukthuanthong (2022) demonstrate that the visual sentiment of Wall Street Journal photographs predicts market returns: days with more negative imagery precede lower returns.
We implement visual sentiment analysis using two approaches: a pre-trained sentiment classifier and a vision-language model that interprets images in financial context.
45.5.2 CNN-Based Visual Sentiment
def compute_visual_sentiment(image_paths, model_name="resnet50"):
"""
Compute visual sentiment scores using a fine-tuned CNN.
Uses features from a pre-trained CNN followed by a sentiment
classifier trained on the Visual Sentiment Ontology (VSO)
or similar dataset.
Parameters
----------
image_paths : list
Paths to news images.
Returns
-------
DataFrame : image_path, positive_score, negative_score, sentiment.
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
model, transform, dim = build_feature_extractor(model_name, device)
# Extract features
features = extract_features(image_paths, model, transform, device)
# Simple sentiment model: use mean activation as proxy
# (In practice, fine-tune on labeled financial images)
# Higher mean activation in certain feature channels
# correlates with positive/negative affect
# Positive channels (empirically determined via validation)
pos_channels = list(range(0, dim // 3))
neg_channels = list(range(dim // 3, 2 * dim // 3))
pos_scores = features[:, pos_channels].mean(axis=1)
neg_scores = features[:, neg_channels].mean(axis=1)
# Normalize to [0, 1]
pos_norm = (pos_scores - pos_scores.min()) / (
pos_scores.max() - pos_scores.min() + 1e-8
)
neg_norm = (neg_scores - neg_scores.min()) / (
neg_scores.max() - neg_scores.min() + 1e-8
)
sentiment = pos_norm - neg_norm
return pd.DataFrame({
"image_path": image_paths,
"positive_score": pos_norm,
"negative_score": neg_norm,
"net_sentiment": sentiment
})45.5.3 Vision-Language Models for Financial Image Understanding
The most powerful approach to financial image analysis uses vision-language models (VLMs), which jointly process images and text. Models such as CLIP (Radford et al. 2021), BLIP-2 (Li et al. 2023), and GPT-4V can be prompted to interpret financial images in context. For instance, given an aerial photograph of a factory, a VLM can answer “Is this factory operating at full capacity?” or “Is there visible construction of additional facilities?”
def vlm_financial_analysis(image_path, prompt, model_name="clip"):
"""
Use a vision-language model to analyze a financial image.
Parameters
----------
image_path : str
Path to image.
prompt : str
Financial analysis prompt.
model_name : str
'clip' for zero-shot classification,
'blip2' for visual question answering.
Returns
-------
dict : Model output (scores or text).
"""
img = Image.open(image_path).convert("RGB")
if model_name == "clip":
from transformers import CLIPProcessor, CLIPModel
clip_model = CLIPModel.from_pretrained(
"openai/clip-vit-base-patch32"
)
processor = CLIPProcessor.from_pretrained(
"openai/clip-vit-base-patch32"
)
# Zero-shot classification with financial labels
labels = [
"busy commercial area with many customers",
"empty commercial area with few customers",
"active construction site with workers",
"idle construction site without activity",
"healthy green crops in agricultural field",
"damaged or dry crops in agricultural field",
"busy port with many ships and containers",
"quiet port with few ships"
]
inputs = processor(
text=labels,
images=img,
return_tensors="pt",
padding=True
)
with torch.no_grad():
outputs = clip_model(**inputs)
logits = outputs.logits_per_image.squeeze()
probs = torch.softmax(logits, dim=0)
results = {
label: prob.item()
for label, prob in zip(labels, probs)
}
return {"scores": results, "top_label": max(results, key=results.get)}
elif model_name == "blip2":
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained(
"Salesforce/blip2-opt-2.7b"
)
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-opt-2.7b",
torch_dtype=torch.float16
)
inputs = processor(images=img, text=prompt, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(**inputs, max_length=100)
answer = processor.decode(
generated_ids[0], skip_special_tokens=True
)
return {"answer": answer}# Construct daily visual sentiment index from news images
# Source: Vietnamese financial news sites (VnExpress, CafeF, etc.)
news_images = dc.get_news_images(
start_date="2018-01-01",
end_date="2024-12-31",
source=["vnexpress_finance", "cafef"]
)
# Aggregate daily visual sentiment
daily_sentiment = (
news_images.groupby("date")
.agg(
visual_sentiment=("net_sentiment", "mean"),
n_images=("net_sentiment", "count"),
pct_negative=("net_sentiment", lambda x: (x < 0).mean())
)
.reset_index()
)
# Merge with market returns
market_returns = dc.get_market_returns(
start_date="2018-01-01",
end_date="2024-12-31",
frequency="daily"
)
sentiment_returns = daily_sentiment.merge(
market_returns[["date", "mkt_ret"]],
on="date",
how="inner"
)
# Lead-lag analysis: does visual sentiment predict next-day returns?
sentiment_returns = sentiment_returns.sort_values("date")
sentiment_returns["mkt_ret_lead1"] = sentiment_returns["mkt_ret"].shift(-1)# Regression: next-day return on today's visual sentiment
sr_clean = sentiment_returns.dropna(
subset=["mkt_ret_lead1", "visual_sentiment", "mkt_ret"]
)
model_sent = sm.OLS(
sr_clean["mkt_ret_lead1"],
sm.add_constant(sr_clean[["visual_sentiment", "mkt_ret"]])
).fit(cov_type="HAC", cov_kwds={"maxlags": 5})
sent_results = pd.DataFrame({
"Coefficient": model_sent.params.round(6),
"Std Error": model_sent.bse.round(6),
"t-stat": model_sent.tvalues.round(3),
"p-value": model_sent.pvalues.round(4)
})
sent_results45.6 Multimodal Fusion: Combining Image and Text
45.6.1 Why Multimodal?
Text and images capture different dimensions of the same underlying economic reality. An earnings report describes financial performance in words and numbers; the accompanying photographs show factories, products, and management. A news article about a port describes trade volumes in text; the satellite image shows actual ship positions. Combining both modalities yields a richer representation than either alone.
The fusion architecture depends on the application:
Early fusion. Concatenate image features \(\mathbf{z}^{\text{img}}\) and text features \(\mathbf{z}^{\text{txt}}\) into a single vector \([\mathbf{z}^{\text{img}}; \mathbf{z}^{\text{txt}}]\) before prediction. Simple but ignores cross-modal interactions.
Late fusion. Train separate models on each modality and combine predictions: \(\hat{y} = \alpha \hat{y}^{\text{img}} + (1-\alpha) \hat{y}^{\text{txt}}\). Robust but cannot learn cross-modal features.
Cross-attention fusion. Use transformer cross-attention to let each modality attend to the other. Most powerful but requires more data and computation.
\[ \mathbf{z}^{\text{fused}} = \text{CrossAttention}(\mathbf{z}^{\text{img}}, \mathbf{z}^{\text{txt}}) = \text{softmax}\left(\frac{\mathbf{Q}^{\text{img}} (\mathbf{K}^{\text{txt}})^\top}{\sqrt{d}}\right) \mathbf{V}^{\text{txt}} \tag{45.3}\]
class MultimodalFusionModel(nn.Module):
"""
Multimodal fusion model combining image and text features
for financial prediction.
Supports early fusion, late fusion, and cross-attention.
"""
def __init__(self, img_dim=2048, txt_dim=768, hidden_dim=256,
fusion="early", n_heads=4):
super().__init__()
self.fusion = fusion
# Image projection
self.img_proj = nn.Sequential(
nn.Linear(img_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2)
)
# Text projection
self.txt_proj = nn.Sequential(
nn.Linear(txt_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2)
)
if fusion == "early":
self.head = nn.Sequential(
nn.Linear(hidden_dim * 2, hidden_dim),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(hidden_dim, 1)
)
elif fusion == "late":
self.img_head = nn.Linear(hidden_dim, 1)
self.txt_head = nn.Linear(hidden_dim, 1)
self.alpha = nn.Parameter(torch.tensor(0.5))
elif fusion == "cross_attention":
self.cross_attn = nn.MultiheadAttention(
embed_dim=hidden_dim,
num_heads=n_heads,
batch_first=True
)
self.head = nn.Sequential(
nn.Linear(hidden_dim, hidden_dim // 2),
nn.ReLU(),
nn.Linear(hidden_dim // 2, 1)
)
def forward(self, img_features, txt_features):
img_h = self.img_proj(img_features)
txt_h = self.txt_proj(txt_features)
if self.fusion == "early":
combined = torch.cat([img_h, txt_h], dim=-1)
return self.head(combined).squeeze(-1)
elif self.fusion == "late":
img_pred = self.img_head(img_h).squeeze(-1)
txt_pred = self.txt_head(txt_h).squeeze(-1)
alpha = torch.sigmoid(self.alpha)
return alpha * img_pred + (1 - alpha) * txt_pred
elif self.fusion == "cross_attention":
# Image attends to text
img_h_unsq = img_h.unsqueeze(1) # (B, 1, D)
txt_h_unsq = txt_h.unsqueeze(1)
attn_out, _ = self.cross_attn(
img_h_unsq, txt_h_unsq, txt_h_unsq
)
return self.head(attn_out.squeeze(1)).squeeze(-1)def run_multimodal_experiment(image_features, text_features, returns,
fusion_types=["early", "late",
"cross_attention"]):
"""
Compare multimodal fusion strategies for return prediction.
Parameters
----------
image_features : np.ndarray
Image feature matrix (N x img_dim).
text_features : np.ndarray
Text feature matrix (N x txt_dim).
returns : np.ndarray
Target returns (N,).
fusion_types : list
Fusion strategies to compare.
Returns
-------
DataFrame : R², MSE, Sharpe for each strategy.
"""
from sklearn.model_selection import TimeSeriesSplit
n = len(returns)
tscv = TimeSeriesSplit(n_splits=5)
results = []
for fusion in fusion_types:
fold_r2s = []
for train_idx, test_idx in tscv.split(returns):
# Convert to tensors
X_img_train = torch.tensor(
image_features[train_idx], dtype=torch.float32
)
X_txt_train = torch.tensor(
text_features[train_idx], dtype=torch.float32
)
y_train = torch.tensor(
returns[train_idx], dtype=torch.float32
)
X_img_test = torch.tensor(
image_features[test_idx], dtype=torch.float32
)
X_txt_test = torch.tensor(
text_features[test_idx], dtype=torch.float32
)
y_test = returns[test_idx]
# Build and train model
model = MultimodalFusionModel(
img_dim=image_features.shape[1],
txt_dim=text_features.shape[1],
fusion=fusion
)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
model.train()
for epoch in range(50):
optimizer.zero_grad()
pred = model(X_img_train, X_txt_train)
loss = loss_fn(pred, y_train)
loss.backward()
optimizer.step()
# Evaluate
model.eval()
with torch.no_grad():
y_pred = model(X_img_test, X_txt_test).numpy()
ss_res = np.sum((y_test - y_pred) ** 2)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
r2 = 1 - ss_res / ss_tot if ss_tot > 0 else 0
fold_r2s.append(r2)
results.append({
"fusion": fusion,
"r2_mean": np.mean(fold_r2s),
"r2_std": np.std(fold_r2s)
})
# Add unimodal baselines
for modality, features in [("image_only", image_features),
("text_only", text_features)]:
from sklearn.linear_model import RidgeCV
fold_r2s = []
for train_idx, test_idx in tscv.split(returns):
ridge = RidgeCV(alphas=np.logspace(-3, 3, 10))
ridge.fit(features[train_idx], returns[train_idx])
y_pred = ridge.predict(features[test_idx])
y_test = returns[test_idx]
ss_res = np.sum((y_test - y_pred) ** 2)
ss_tot = np.sum((y_test - y_test.mean()) ** 2)
fold_r2s.append(1 - ss_res / ss_tot if ss_tot > 0 else 0)
results.append({
"fusion": modality,
"r2_mean": np.mean(fold_r2s),
"r2_std": np.std(fold_r2s)
})
return pd.DataFrame(results)45.6.2 Practical Considerations for Vietnamese Markets
Multimodal analysis in Vietnamese markets faces several practical considerations:
Data alignment. Satellite images, news articles, and market data operate on different temporal frequencies and spatial resolutions. Satellite composites are available weekly or biweekly; news is daily; trading is intraday. Proper alignment requires specifying the information set available to an investor at the time of the trading decision to avoid look-ahead bias.
Label scarcity. Supervised learning requires labeled data (e.g., images annotated with economic outcomes). In Vietnam, ground-truth labels (actual retail sales, actual crop yields, actual port throughput) arrive with significant lags and often lack the granularity to match satellite resolution. Semi-supervised and self-supervised approaches are therefore essential.
Regulatory considerations. High-resolution satellite imagery of specific commercial or military installations may be restricted. Researchers should verify that their imagery sources comply with Vietnamese regulations on geospatial data.
Computational cost. Processing satellite tiles through CNNs is computationally intensive. A single Sentinel-2 tile at 10m resolution covering Ho Chi Minh City contains approximately \(10{,}980 \times 10{,}980\) pixels per band. Tiling into \(224 \times 224\) patches for CNN input generates \(\sim 2{,}400\) patches per tile, each requiring a forward pass through the network.
| Application | Image Source | Resolution | Frequency | Vietnamese Availability |
|---|---|---|---|---|
| Nighttime luminosity | VIIRS/DMSP | 500m | Monthly | Free (NOAA/EOG) |
| Crop health | MODIS/Sentinel-2 | 250m/10m | 16-day/5-day | Free (NASA/ESA) |
| Port/ship detection | Sentinel-1 (SAR) | 10m | 12-day | Free (ESA Copernicus) |
| Construction monitoring | Commercial (Maxar) | 30cm | On demand | Paid ($) |
| Urban density | Sentinel-2 | 10m | 5-day | Free (ESA) |
| Document OCR | Corporate filings | N/A | Event-driven | DataCore.vn |
| News images | Financial media | N/A | Daily | Web scraping |
45.7 Summary
This chapter extended the alternative data toolkit from text (the previous chapter) to images. We demonstrated five distinct application domains for visual data in Vietnamese financial markets.
First, satellite and geospatial imagery provides high-frequency, spatially granular economic signals that lead official statistics. Nighttime luminosity serves as a provincial GDP proxy with cross-sectional \(R^2\) exceeding 0.7; NDVI crop health indices predict agricultural firm returns; and CNN features extracted from satellite tiles enable rich spatial representations of economic activity.
Second, document image analysis solves the practical problem of extracting structured data from Vietnamese financial filings that arrive as scanned images. The pipeline (e.g., OCR with Vietnamese-optimized engines, layout analysis, table extraction, and LayoutLM-based document understanding) converts unstructured pixels into the structured financial data that all downstream analyses require.
Third, chart digitization recovers numerical data series from visual representations, extending historical coverage and enabling systematic consumption of analyst outputs. Fourth, visual sentiment analysis from news imagery provides a signal dimension orthogonal to textual sentiment, with potential predictive power for market returns.
Fifth, multimodal fusion (combining image and text representations via early, late, or cross-attention architectures) yields richer predictive models than either modality alone. The practical benefit of multimodal approaches scales with the diversity and quality of available data, making it increasingly relevant as Vietnamese alternative data ecosystems mature.
The common thread across all applications is the transformation pipeline: raw pixel tensor \(\to\) feature representation (via CNN, ViT, or VLM) \(\to\) financial signal \(\to\) economic interpretation. The choice of architecture and the quality of the domain adaptation determine whether the resulting signal has genuine predictive content or merely captures noise.