45 Image and Visual Data in Finance

The previous chapter demonstrated how unstructured text (e.g., earnings reports, news articles, business descriptions) can be transformed into structured signals for financial analysis. This chapter extends the alternative data toolkit to a second modality: images. Visual data is abundant in financial contexts yet systematically underexploited. Satellite photographs reveal real economic activity (e.g., parking lot occupancy at retail locations, construction progress at industrial sites, nighttime luminosity as a proxy for regional GDP, ship traffic at port terminals, crop health across agricultural zones). Corporate documents arrive as scanned PDFs whose tables and figures resist standard text extraction. Financial charts encode information that analysts interpret visually but that systematic strategies cannot consume without digitization. And the visual content of social media, advertising, and product imagery carries sentiment and brand signals that complement textual analysis.

The core challenge is representational: an image is a three-dimensional tensor of pixel intensities with no inherent semantic structure. Converting this raw array into a financial signal (i.e., a number that predicts returns, measures risk, or proxies for economic activity) requires either hand-crafted feature engineering or learned representations via deep convolutional neural networks (CNNs) and vision transformers (ViTs). This chapter covers both approaches.

We organize the material around five application domains, each with distinct data sources, modeling requirements, and economic motivations. First, satellite and geospatial imagery for nowcasting economic activity. Second, document image analysis for extracting structured data from Vietnamese financial filings. Third, chart and figure digitization for systematic backtesting. Fourth, visual sentiment analysis from social and news media. Fifth, multimodal fusion, combining image and text signals into joint predictive models.

Vietnamese markets present particular opportunities in this space. Satellite imagery is especially informative in an economy with large agricultural and manufacturing sectors where ground-truth data arrives with significant lags. Vietnamese financial filings are often distributed as scanned images rather than machine-readable formats, making document AI essential rather than optional. And the rapid urbanization visible in construction and infrastructure imagery provides high-frequency proxies for macroeconomic momentum that official statistics cannot match.

import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

# Core image processing
from PIL import Image
import io

# Deep learning
import torch
import torch.nn as nn
import torchvision.transforms as transforms
import torchvision.models as models

# Visualization
import plotnine as p9
from mizani.formatters import percent_format, comma_format
import matplotlib.pyplot as plt

# Statistical analysis
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.panel import PanelOLS

45.1 Foundations: From Pixels to Financial Signals

45.1.1 Image Representation

A digital image is a function $I: \{1, \ldots, H\} \times \{1, \ldots, W\} \times \{1, \ldots, C\} \rightarrow [0, 255]$ mapping spatial coordinates and color channels to intensity values. For an RGB image of height $H$ and width $W$, the representation is a tensor $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$. A single $224 \times 224$ RGB image (i.e., the standard input for modern CNNs) contains $224 \times 224 \times 3 = 150{,}528$ dimensions. This extreme dimensionality, combined with spatial structure (nearby pixels are correlated), makes images fundamentally different from tabular financial data and demands specialized architectures.

The key insight of convolutional neural networks is parameter sharing via local filters. A convolutional layer applies a kernel $\mathbf{K} \in \mathbb{R}^{k \times k \times C_{\text{in}}}$ to produce a feature map:

\[ (\mathbf{I} * \mathbf{K})(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=1}^{C_{\text{in}}} I(i+m, j+n, c) \cdot K(m, n, c) \tag{45.1}\]

By stacking convolutional layers with nonlinearities and pooling operations, the network builds a hierarchy of representations: early layers detect edges and textures; middle layers detect parts and patterns; deep layers detect objects and scenes. The final layer output $\mathbf{z} \in \mathbb{R}^{d}$ (with $d$ typically 512-2048) is a compact representation of the image’s semantic content, which can be used directly as a feature vector for financial prediction.

Financial images span a wide range of resolutions and modalities:

Table 45.1: Image Data Sources for Vietnamese Financial Markets

Source	Resolution	Channels	Typical Size	Update Frequency
Sentinel-2 satellite	10m/pixel	13 bands	10,980 × 10,980	5 days
Planet Labs	3m/pixel	4 bands	4,000 × 4,000	Daily
VIIRS nightlights	500m/pixel	1 (DNB)	3,000 × 1,800	Monthly composite
Annual report scan	300 DPI	3 (RGB)	2,480 × 3,508	Annual
CEO photograph	Varies	3 (RGB)	500 × 500	Annual
News photograph	Varies	3 (RGB)	800 × 600	Real-time
Financial chart	Varies	3 (RGB)	1,000 × 600	Real-time

45.1.2 Transfer Learning for Finance

Training a CNN from scratch requires millions of labeled images, which is far more than any financial application can provide. Transfer learning solves this by using networks pre-trained on ImageNet (1.2 million images, 1,000 classes) as feature extractors. The pre-trained network has already learned generic visual representations (edges, textures, shapes, objects); we simply replace the final classification layer with a task-specific head.

Formally, let $f_{\boldsymbol{\theta}}(\mathbf{I})$ denote a pre-trained network with parameters $\boldsymbol{\theta}$ partitioned into feature extractor $\boldsymbol{\theta}_{\text{feat}}$ and classifier $\boldsymbol{\theta}_{\text{cls}}$. For financial applications, we:

Feature extraction: Freeze $\boldsymbol{\theta}_{\text{feat}}$, extract $\mathbf{z} = f_{\boldsymbol{\theta}_{\text{feat}}}(\mathbf{I})$, and train a simple model (linear regression, gradient boosting) on $\mathbf{z}$.
Fine-tuning: Initialize from $\boldsymbol{\theta}$ and train all parameters on the financial task with a small learning rate to avoid catastrophic forgetting.

The key architectural families are:

ResNet [He et al. (2016)]. Residual connections ($y = F(x) + x$) enable training of very deep networks (50-152 layers). The skip connection solves the vanishing gradient problem. ResNet-50 produces a 2,048-dimensional feature vector from the penultimate layer.

EfficientNet [Tan and Le (2019)]. Compound scaling of depth, width, and resolution simultaneously. EfficientNet-B0 achieves ResNet-50 accuracy with 5.3M parameters (vs. 25.6M), making it practical for processing thousands of satellite tiles.

Vision Transformer (ViT) [Dosovitskiy et al. (2020)]. Treats an image as a sequence of $16 \times 16$ patches, processes them through a standard Transformer encoder. ViT-B/16 produces a 768-dimensional embedding. Particularly effective for document images where spatial relationships between elements (tables, headers, text blocks) matter.

# DataCore.vn API
from datacore import DataCore
dc = DataCore()

def build_feature_extractor(model_name="resnet50", device="cpu"):
    """
    Build a pre-trained CNN feature extractor.

    Parameters
    ----------
    model_name : str
        One of 'resnet50', 'efficientnet_b0', 'vit_b_16'.
    device : str
        'cpu' or 'cuda'.

    Returns
    -------
    model : nn.Module
        Feature extraction model.
    transform : transforms.Compose
        Image preprocessing pipeline.
    """
    if model_name == "resnet50":
        weights = models.ResNet50_Weights.IMAGENET1K_V2
        model = models.resnet50(weights=weights)
        model.fc = nn.Identity()  # Remove classification head
        dim = 2048
    elif model_name == "efficientnet_b0":
        weights = models.EfficientNet_B0_Weights.IMAGENET1K_V1
        model = models.efficientnet_b0(weights=weights)
        model.classifier = nn.Identity()
        dim = 1280
    elif model_name == "vit_b_16":
        weights = models.ViT_B_16_Weights.IMAGENET1K_V1
        model = models.vit_b_16(weights=weights)
        model.heads = nn.Identity()
        dim = 768

    model = model.to(device).eval()

    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )
    ])

    return model, transform, dim


def extract_features(image_paths, model, transform, device="cpu",
                     batch_size=32):
    """
    Extract deep features from a list of images.

    Parameters
    ----------
    image_paths : list
        Paths to image files.
    model : nn.Module
        Feature extraction model.
    transform : transforms.Compose
        Preprocessing pipeline.

    Returns
    -------
    np.ndarray : Feature matrix (n_images x feature_dim).
    """
    features = []

    for i in range(0, len(image_paths), batch_size):
        batch_paths = image_paths[i:i + batch_size]
        batch_tensors = []

        for path in batch_paths:
            try:
                img = Image.open(path).convert("RGB")
                tensor = transform(img)
                batch_tensors.append(tensor)
            except Exception:
                batch_tensors.append(torch.zeros(3, 224, 224))

        batch = torch.stack(batch_tensors).to(device)

        with torch.no_grad():
            batch_features = model(batch).cpu().numpy()

        features.append(batch_features)

    return np.vstack(features)

45.2 Satellite and Geospatial Imagery

45.2.1 Economic Activity from Space

Satellite imagery provides high-frequency, spatially granular measurements of economic activity that are independent of and often lead official statistics. The foundational work of Henderson, Storeygard, and Weil (2012) demonstrated that nighttime luminosity, measured by the Defense Meteorological Satellite Program (DMSP), is a reliable proxy for GDP, particularly in countries where official statistics are noisy or delayed. Donaldson (2018) use satellite-derived agricultural output measures to study the welfare gains from railroads in colonial India. Jean et al. (2016) combine daytime satellite imagery with CNNs to predict poverty from space with $r^2 > 0.7$.

For financial applications, the key insight is that satellite data arrives faster than corporate earnings or government statistics. A retailer’s quarterly revenue is reported 4-8 weeks after the quarter ends; satellite imagery of its parking lots is available within days. This temporal advantage creates a natural use case for nowcasting (i.e., estimating current economic conditions before official data arrives) and for constructing trading signals based on information that is public but costly to process.

45.2.2 Application 1: Nighttime Luminosity and Provincial GDP

Vietnam’s General Statistics Office (GSO) publishes provincial GDP with a lag of several months. Nighttime luminosity from the VIIRS (Visible Infrared Imaging Radiometer Suite) sensor provides a near-real-time alternative. We construct a firm-level exposure measure by linking each listed firm’s registered location to the luminosity of its province.

# Load nighttime luminosity data (VIIRS monthly composites)
# Source: Earth Observation Group (EOG) / NOAA
nightlights = dc.get_nightlight_data(
    start_date="2014-01-01",
    end_date="2024-12-31",
    resolution="province"
)

# Load firm location data
firm_locations = dc.get_firm_locations()

# Provincial GDP from GSO
provincial_gdp = dc.get_provincial_gdp(
    start_date="2014-01-01",
    end_date="2024-12-31"
)

print(f"Nightlight observations: {len(nightlights)}")
print(f"Provinces covered: {nightlights['province'].nunique()}")
print(f"Firms with location: {firm_locations['ticker'].nunique()}")

# Validate: does nightlight predict provincial GDP?
nl_gdp = nightlights.merge(
    provincial_gdp,
    on=["province", "year", "quarter"],
    how="inner"
)

# Log-log specification (standard in the literature)
nl_gdp["ln_luminosity"] = np.log(nl_gdp["mean_radiance"].clip(lower=0.01))
nl_gdp["ln_gdp"] = np.log(nl_gdp["provincial_gdp"].clip(lower=1))

# Cross-sectional regression by year
validation_results = []
for year in nl_gdp["year"].unique():
    subset = nl_gdp[nl_gdp["year"] == year]
    if len(subset) < 20:
        continue

    model = sm.OLS(
        subset["ln_gdp"],
        sm.add_constant(subset["ln_luminosity"])
    ).fit()

    validation_results.append({
        "year": year,
        "beta": model.params.iloc[1],
        "r_squared": model.rsquared,
        "n_provinces": int(model.nobs)
    })

validation_df = pd.DataFrame(validation_results)
print(f"Avg R² (ln GDP ~ ln Luminosity): {validation_df['r_squared'].mean():.3f}")

(
    p9.ggplot(nl_gdp[nl_gdp["year"] == 2023],
              p9.aes(x="ln_luminosity", y="ln_gdp"))
    + p9.geom_point(color="#2E5090", alpha=0.6, size=2)
    + p9.geom_smooth(method="lm", color="#C0392B", se=True, size=0.8)
    + p9.labs(
        x="ln(Mean Nighttime Radiance)",
        y="ln(Provincial GDP)",
        title="Nighttime Luminosity vs. Provincial GDP (2023)"
    )
    + p9.theme_minimal()
    + p9.theme(figure_size=(10, 6))
)

Figure 45.1

# Construct firm-level nightlight signal
# Logic: firms in provinces with accelerating luminosity
# are experiencing positive local economic conditions

nl_growth = nightlights.copy().sort_values(["province", "year", "quarter"])

# Year-over-year luminosity growth by province
nl_growth["ln_radiance"] = np.log(nl_growth["mean_radiance"].clip(lower=0.01))
nl_growth["ln_radiance_lag4"] = nl_growth.groupby(
    "province"
)["ln_radiance"].shift(4)

nl_growth["nl_growth"] = nl_growth["ln_radiance"] - nl_growth["ln_radiance_lag4"]

# Merge with firms via province
firm_nl = firm_locations[["ticker", "province"]].merge(
    nl_growth[["province", "year", "quarter", "nl_growth"]],
    on="province",
    how="inner"
)

# Merge with stock returns
monthly_returns = dc.get_monthly_returns(
    start_date="2014-01-01",
    end_date="2024-12-31"
)
monthly_returns["year"] = monthly_returns["date"].dt.year
monthly_returns["quarter"] = monthly_returns["date"].dt.quarter

returns_with_nl = monthly_returns.merge(
    firm_nl,
    on=["ticker", "year", "quarter"],
    how="inner"
)

# Portfolio sort: quintiles on provincial nightlight growth
returns_with_nl["nl_quintile"] = returns_with_nl.groupby("date")[
    "nl_growth"
].transform(lambda x: pd.qcut(x, 5, labels=[1, 2, 3, 4, 5],
                                duplicates="drop"))

nl_port_returns = (
    returns_with_nl.groupby(["date", "nl_quintile"])
    .agg(port_ret=("ret", "mean"))
    .reset_index()
)

nl_wide = nl_port_returns.pivot(
    index="date", columns="nl_quintile", values="port_ret"
)
nl_wide["L-S"] = nl_wide[5] - nl_wide[1]  # High NL growth - Low

Table 45.2: Nighttime Luminosity Growth Quintile Portfolio Returns

nl_summary = nl_wide.describe().T[["mean", "std"]].copy()
nl_summary["mean_ann"] = nl_summary["mean"] * 12
nl_summary["sharpe"] = (
    nl_summary["mean_ann"] / (nl_summary["std"] * np.sqrt(12))
)
for col in nl_wide.columns:
    t_stat = nl_wide[col].mean() / (
        nl_wide[col].std() / np.sqrt(len(nl_wide.dropna()))
    )
    nl_summary.loc[col, "t_stat"] = t_stat

nl_summary = nl_summary[["mean_ann", "sharpe", "t_stat"]].round(4)
nl_summary.columns = ["Ann. Return", "Sharpe", "t-stat"]
nl_summary

45.2.3 Application 2: Satellite Imagery for Sector Nowcasting

Beyond luminosity, daytime satellite imagery provides sector-specific signals. We implement three channels relevant to the Vietnamese economy.

Port activity. Vietnam is a major export-oriented economy. Satellite imagery of container ports (Cát Lái, Hải Phòng) captures trade throughput before customs statistics are released. Ship detection algorithms applied to synthetic aperture radar (SAR) imagery count vessels and estimate cargo volumes.

Construction progress. Real estate and construction constitute a significant fraction of Vietnamese GDP and market capitalization. Change detection algorithms applied to high-resolution optical imagery identify construction starts, completion rates, and land-use conversion.

Agricultural monitoring. Vietnam is a leading exporter of rice, coffee, rubber, and seafood. The Normalized Difference Vegetation Index (NDVI), computed from multispectral satellite data, provides crop health assessments:

\[ \text{NDVI} = \frac{\rho_{\text{NIR}} - \rho_{\text{Red}}}{\rho_{\text{NIR}} + \rho_{\text{Red}}} \tag{45.2}\]

where $\rho_{\text{NIR}}$ and $\rho_{\text{Red}}$ are reflectance in the near-infrared and red bands. NDVI ranges from $-1$ to $+1$, with values above 0.3 indicating healthy vegetation. Deviations from seasonal norms proxy for crop yield surprises.

# Load NDVI data for Vietnamese agricultural regions
# Source: MODIS/Terra (MOD13Q1, 250m resolution, 16-day composites)
ndvi_data = dc.get_ndvi_data(
    start_date="2014-01-01",
    end_date="2024-12-31",
    regions=["mekong_delta", "central_highlands",
             "red_river_delta", "southeast"]
)

# Compute NDVI anomaly: deviation from 5-year seasonal average
ndvi_data["month"] = ndvi_data["date"].dt.month

seasonal_mean = (
    ndvi_data.groupby(["region", "month"])
    ["mean_ndvi"].transform(
        lambda x: x.rolling(5 * 12, min_periods=12).mean()
    )
)
ndvi_data["ndvi_anomaly"] = ndvi_data["mean_ndvi"] - seasonal_mean

# Agricultural sector firms
agri_firms = dc.get_firms_by_sector(sector="agriculture")

# Link NDVI anomaly to agricultural firm returns
agri_returns = monthly_returns[
    monthly_returns["ticker"].isin(agri_firms["ticker"])
].copy()

agri_returns["month"] = agri_returns["date"].dt.month
agri_returns["year"] = agri_returns["date"].dt.year

# Regional NDVI aggregation (Mekong Delta for rice firms, etc.)
mekong_ndvi = ndvi_data[ndvi_data["region"] == "mekong_delta"].copy()
mekong_ndvi["year"] = mekong_ndvi["date"].dt.year
mekong_ndvi["month"] = mekong_ndvi["date"].dt.month

mekong_monthly = (
    mekong_ndvi.groupby(["year", "month"])
    .agg(ndvi_anomaly=("ndvi_anomaly", "mean"))
    .reset_index()
)

ndvi_plot = ndvi_data[ndvi_data["region"] == "mekong_delta"].copy()

(
    p9.ggplot(ndvi_plot, p9.aes(x="date", y="ndvi_anomaly"))
    + p9.geom_line(color="#27AE60", alpha=0.5, size=0.4)
    + p9.geom_smooth(method="lowess", color="#2E5090", size=1, se=False)
    + p9.geom_hline(yintercept=0, linetype="dashed", color="gray")
    + p9.labs(
        x="",
        y="NDVI Anomaly",
        title="Mekong Delta Vegetation Health: Deviation from Seasonal Norm"
    )
    + p9.theme_minimal()
    + p9.theme(figure_size=(12, 5))
)

Figure 45.2

# Panel regression: agricultural firm returns on NDVI anomaly
agri_panel = agri_returns.merge(
    mekong_monthly,
    on=["year", "month"],
    how="inner"
)

# Lagged NDVI anomaly (one month)
agri_panel = agri_panel.sort_values(["ticker", "date"])
agri_panel["ndvi_lag1"] = agri_panel.groupby(
    "ticker"
)["ndvi_anomaly"].shift(1)

agri_clean = agri_panel.dropna(
    subset=["ret", "ndvi_lag1"]
).set_index(["ticker", "date"])

model_ndvi = PanelOLS(
    agri_clean["ret"],
    agri_clean[["ndvi_lag1"]],
    entity_effects=True,
    time_effects=True,
    check_rank=False
).fit(cov_type="clustered", cluster_entity=True)

agri_clean = agri_clean.reset_index()

print(f"NDVI → Agricultural Returns:")
print(f"  β(NDVI_lag): {model_ndvi.params['ndvi_lag1']:.4f}")
print(f"  t-stat: {model_ndvi.tstats['ndvi_lag1']:.3f}")
print(f"  R² (within): {model_ndvi.rsquared_within:.4f}")

45.2.4 Satellite Feature Extraction with CNNs

For raw satellite imagery (rather than pre-computed indices like NDVI), we use transfer learning from CNNs to extract spatial features. The approach follows Jean et al. (2016): use a CNN pre-trained on ImageNet to extract feature vectors from satellite tiles, then regress economic outcomes on these features.

def satellite_feature_pipeline(image_dir, model_name="resnet50"):
    """
    Extract CNN features from satellite image tiles.

    Parameters
    ----------
    image_dir : str or Path
        Directory containing satellite tiles (PNG/TIFF).
    model_name : str
        Pre-trained model to use.

    Returns
    -------
    DataFrame : image_id, feature vector columns.
    """
    image_dir = Path(image_dir)
    image_paths = sorted(image_dir.glob("*.png")) + sorted(
        image_dir.glob("*.tif")
    )

    if not image_paths:
        print("No images found.")
        return pd.DataFrame()

    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, transform, dim = build_feature_extractor(model_name, device)

    features = extract_features(image_paths, model, transform, device)

    # Create DataFrame
    feature_cols = [f"feat_{i}" for i in range(dim)]
    df = pd.DataFrame(features, columns=feature_cols)
    df["image_id"] = [p.stem for p in image_paths]

    return df


def predict_economic_activity(features_df, labels_df, label_col,
                              n_components=50):
    """
    Predict economic activity from satellite image features.

    Uses PCA for dimensionality reduction, then ridge regression.

    Parameters
    ----------
    features_df : DataFrame
        CNN features with image_id.
    labels_df : DataFrame
        Economic outcomes with image_id.
    label_col : str
        Target variable column name.
    n_components : int
        PCA components to retain.

    Returns
    -------
    dict : R², coefficients, cross-validated performance.
    """
    from sklearn.decomposition import PCA
    from sklearn.linear_model import RidgeCV
    from sklearn.model_selection import cross_val_score

    merged = features_df.merge(labels_df, on="image_id")
    feature_cols = [c for c in features_df.columns if c.startswith("feat_")]

    X = merged[feature_cols].values
    y = merged[label_col].values

    # PCA
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)
    var_explained = pca.explained_variance_ratio_.sum()

    # Ridge regression with cross-validation
    ridge = RidgeCV(alphas=np.logspace(-3, 3, 20), cv=5)
    cv_scores = cross_val_score(ridge, X_pca, y, cv=5, scoring="r2")

    ridge.fit(X_pca, y)

    return {
        "r2_cv_mean": cv_scores.mean(),
        "r2_cv_std": cv_scores.std(),
        "r2_train": ridge.score(X_pca, y),
        "pca_var_explained": var_explained,
        "optimal_alpha": ridge.alpha_,
        "n_images": len(merged)
    }

45.3 Document Image Analysis

45.3.1 The Vietnamese Filing Problem

A substantial fraction of Vietnamese corporate disclosures (e.g., annual reports, financial statements, board resolutions, shareholder meeting minutes) are distributed as scanned PDF images rather than machine-readable text. This creates a data extraction bottleneck: the information exists but is trapped in pixel format. Unlike filings in more developed markets (where XBRL mandates ensure machine readability), Vietnamese filings require Optical Character Recognition (OCR) and layout analysis before any quantitative analysis can begin.

The document AI pipeline for Vietnamese financial filings involves four stages:

Page classification: Identify which pages contain financial statements, management discussion, audit opinions, etc.
Layout analysis: Detect the spatial structure such as headers, paragraphs, tables, figures, captions.
OCR: Convert image regions to text, using Vietnamese-optimized models.
Structured extraction: Parse the recognized text into structured data (e.g., revenue figures, balance sheet items).

45.3.2 OCR for Vietnamese Financial Documents

Standard OCR engines (Tesseract, Google Cloud Vision) struggle with Vietnamese financial documents due to the combination of Vietnamese diacritics (ă, ơ, ư, ê, etc.), mixed Vietnamese-English content, and complex table layouts. We implement a pipeline using PaddleOCR (which has strong CJK and Southeast Asian language support) and VietOCR (a Vietnamese-specific model based on the transformer architecture of Baek et al. (2019)).

def ocr_financial_document(pdf_path, language="vi",
                           engine="paddleocr"):
    """
    OCR a Vietnamese financial document (scanned PDF).

    Parameters
    ----------
    pdf_path : str
        Path to PDF file.
    language : str
        Language code.
    engine : str
        'paddleocr' or 'vietocr'.

    Returns
    -------
    list[dict] : Per-page OCR results with bounding boxes.
    """
    from pdf2image import convert_from_path

    # Convert PDF pages to images
    pages = convert_from_path(pdf_path, dpi=300)

    results = []

    if engine == "paddleocr":
        from paddleocr import PaddleOCR
        ocr = PaddleOCR(use_angle_cls=True, lang="vi", use_gpu=False)

        for page_num, page_img in enumerate(pages):
            # Convert PIL to numpy
            img_array = np.array(page_img)
            ocr_result = ocr.ocr(img_array, cls=True)

            page_texts = []
            for line in ocr_result[0]:
                bbox, (text, confidence) = line
                page_texts.append({
                    "text": text,
                    "confidence": confidence,
                    "bbox": bbox,
                    "page": page_num + 1
                })

            results.extend(page_texts)

    return results


def classify_page_type(ocr_results, page_num):
    """
    Classify a document page by content type using keyword matching.

    Returns one of: 'balance_sheet', 'income_statement',
    'cash_flow', 'notes', 'audit', 'management', 'other'.
    """
    page_text = " ".join(
        [r["text"] for r in ocr_results if r["page"] == page_num]
    ).lower()

    # Vietnamese financial statement keywords
    keyword_map = {
        "balance_sheet": [
            "bảng cân đối kế toán", "tài sản", "nguồn vốn",
            "nợ phải trả", "vốn chủ sở hữu"
        ],
        "income_statement": [
            "kết quả hoạt động kinh doanh", "doanh thu",
            "lợi nhuận", "chi phí", "thu nhập"
        ],
        "cash_flow": [
            "lưu chuyển tiền tệ", "dòng tiền",
            "hoạt động kinh doanh", "hoạt động đầu tư"
        ],
        "audit": [
            "báo cáo kiểm toán", "kiểm toán viên",
            "ý kiến kiểm toán", "trung thực và hợp lý"
        ],
        "management": [
            "ban giám đốc", "hội đồng quản trị",
            "báo cáo thường niên", "tình hình hoạt động"
        ]
    }

    scores = {}
    for page_type, keywords in keyword_map.items():
        scores[page_type] = sum(
            1 for kw in keywords if kw in page_text
        )

    if max(scores.values()) == 0:
        return "other"
    return max(scores, key=scores.get)

45.3.3 Table Extraction from Financial Statements

The highest-value extraction task is recovering structured tables from financial statements. We implement a two-stage approach: first detect table regions using a layout analysis model, then parse the detected regions into row-column structure.

def extract_tables_from_page(page_image, ocr_results, page_num):
    """
    Extract structured tables from a document page.

    Uses spatial clustering of OCR bounding boxes to identify
    table regions, then aligns text into rows and columns.

    Parameters
    ----------
    page_image : PIL.Image
        Page image.
    ocr_results : list[dict]
        OCR results for this page.
    page_num : int
        Page number.

    Returns
    -------
    list[pd.DataFrame] : Extracted tables as DataFrames.
    """
    page_texts = [r for r in ocr_results if r["page"] == page_num]

    if not page_texts:
        return []

    # Extract bounding box centers
    centers = []
    for item in page_texts:
        bbox = item["bbox"]
        # bbox is [[x1,y1],[x2,y2],[x3,y3],[x4,y4]]
        cx = np.mean([p[0] for p in bbox])
        cy = np.mean([p[1] for p in bbox])
        centers.append((cx, cy, item["text"]))

    if not centers:
        return []

    centers_df = pd.DataFrame(centers, columns=["x", "y", "text"])

    # Cluster into rows by y-coordinate proximity
    centers_df = centers_df.sort_values("y")
    row_threshold = 15  # pixels
    centers_df["row_id"] = (
        centers_df["y"].diff().abs() > row_threshold
    ).cumsum()

    # Within each row, sort by x-coordinate
    tables = []
    rows = []
    for row_id, row_group in centers_df.groupby("row_id"):
        row_sorted = row_group.sort_values("x")
        rows.append(row_sorted["text"].tolist())

    if len(rows) > 2:
        # Attempt to construct DataFrame
        max_cols = max(len(r) for r in rows)
        # Pad shorter rows
        padded = [r + [""] * (max_cols - len(r)) for r in rows]

        try:
            df = pd.DataFrame(padded[1:], columns=padded[0])
            tables.append(df)
        except Exception:
            tables.append(pd.DataFrame(padded))

    return tables


def parse_financial_numbers(text):
    """
    Parse Vietnamese financial number formats.
    Vietnamese uses dots as thousands separators and commas as decimals.
    E.g., '1.234.567' = 1234567, '1.234,56' = 1234.56
    """
    import re
    text = text.strip().replace(" ", "")

    # Remove parentheses (negative indicator)
    negative = text.startswith("(") and text.endswith(")")
    text = text.strip("()")

    # Handle Vietnamese number format
    # If comma is present, it's a decimal separator
    if "," in text:
        text = text.replace(".", "").replace(",", ".")
    else:
        text = text.replace(".", "")

    try:
        value = float(text)
        return -value if negative else value
    except ValueError:
        return np.nan

45.3.4 Layout-Aware Document Understanding

Modern document AI goes beyond OCR by jointly modeling text content and spatial layout. LayoutLM (Huang et al. 2022) and its successors treat each token as having both a text embedding and a positional embedding derived from its bounding box coordinates. This allows the model to understand that a number positioned below a “Revenue” header and to the right of “2023” is the 2023 revenue figure, even without explicit table detection.

def layoutlm_extract(document_pages, model_name="layoutlmv3"):
    """
    Extract structured financial data using LayoutLM.

    This function uses the pre-trained LayoutLMv3 model for
    document understanding with Vietnamese financial statements.

    Parameters
    ----------
    document_pages : list
        List of (page_image, ocr_results) tuples.
    model_name : str
        Model variant.

    Returns
    -------
    dict : Extracted financial fields.
    """
    from transformers import (
        LayoutLMv3ForTokenClassification,
        LayoutLMv3Processor
    )

    processor = LayoutLMv3Processor.from_pretrained(
        "microsoft/layoutlmv3-base",
        apply_ocr=False  # We provide our own OCR
    )

    model = LayoutLMv3ForTokenClassification.from_pretrained(
        "microsoft/layoutlmv3-base",
        num_labels=13  # Financial statement field types
    )

    # Define target fields for extraction
    field_labels = [
        "O",  # Other
        "B-REVENUE", "I-REVENUE",
        "B-COGS", "I-COGS",
        "B-NET_INCOME", "I-NET_INCOME",
        "B-TOTAL_ASSETS", "I-TOTAL_ASSETS",
        "B-TOTAL_EQUITY", "I-TOTAL_EQUITY",
        "B-TOTAL_DEBT", "I-TOTAL_DEBT"
    ]

    extracted = {}

    for page_img, ocr_results in document_pages:
        words = [r["text"] for r in ocr_results]
        boxes = []
        for r in ocr_results:
            bbox = r["bbox"]
            # Normalize to 0-1000 range
            x0 = min(p[0] for p in bbox)
            y0 = min(p[1] for p in bbox)
            x1 = max(p[0] for p in bbox)
            y1 = max(p[1] for p in bbox)
            boxes.append([int(x0), int(y0), int(x1), int(y1)])

        if not words:
            continue

        # Process through LayoutLM
        encoding = processor(
            page_img,
            words,
            boxes=boxes,
            return_tensors="pt",
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = model(**encoding)

        predictions = outputs.logits.argmax(-1).squeeze().tolist()

        # Extract labeled entities
        for idx, pred in enumerate(predictions):
            if pred > 0 and idx < len(words):
                label = field_labels[pred]
                if label.startswith("B-"):
                    field = label[2:]
                    value = parse_financial_numbers(words[idx])
                    if not np.isnan(value):
                        extracted[field] = value

    return extracted

45.4 Chart and Figure Digitization

45.4.1 Motivation: Unlocking Visual Financial Data

Financial charts (e.g., price time series, bar charts of earnings, scatter plots of risk-return tradeoffs) embed information that analysts process visually. For systematic strategies, this information must be converted to numerical form. Three use cases motivate chart digitization:

Historical data recovery. Pre-digital financial data often exists only in printed charts. Digitizing these charts extends historical time series beyond the electronic era.
Broker report extraction. Sell-side research reports contain charts with projections and scenario analyses. Extracting these programmatically enables systematic aggregation of analyst views.
Regulatory filings. Vietnamese regulatory filings sometimes embed data as images (charts, scanned tables) rather than as machine-readable values.

45.4.2 Chart Type Classification

The first step is classifying the chart type (line, bar, scatter, pie, candlestick), which determines the appropriate digitization algorithm.

def build_chart_classifier(n_classes=5):
    """
    Build a CNN-based chart type classifier.

    Classes: line_chart, bar_chart, scatter_plot,
             candlestick, pie_chart.
    """
    model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

    # Replace final layer for chart classification
    model.fc = nn.Sequential(
        nn.Dropout(0.3),
        nn.Linear(512, n_classes)
    )

    return model


def classify_chart(image_path, model, transform):
    """Classify a chart image into one of 5 types."""
    class_names = [
        "line_chart", "bar_chart", "scatter_plot",
        "candlestick", "pie_chart"
    ]

    img = Image.open(image_path).convert("RGB")
    tensor = transform(img).unsqueeze(0)

    with torch.no_grad():
        logits = model(tensor)
        probs = torch.softmax(logits, dim=1).squeeze()

    pred_idx = probs.argmax().item()
    return {
        "predicted_class": class_names[pred_idx],
        "confidence": probs[pred_idx].item(),
        "all_probs": {
            name: probs[i].item()
            for i, name in enumerate(class_names)
        }
    }

45.4.3 Line Chart Digitization

For line charts, the digitization task is to recover the $(x, y)$ data series from the image. The pipeline involves axis detection, scale calibration, and curve tracing.

def digitize_line_chart(image_path, x_range=None, y_range=None):
    """
    Digitize a line chart image to recover the data series.

    Parameters
    ----------
    image_path : str
        Path to chart image.
    x_range : tuple, optional
        (x_min, x_max) if known.
    y_range : tuple, optional
        (y_min, y_max) if known.

    Returns
    -------
    DataFrame : Digitized data points (x, y).
    """
    import cv2

    img = cv2.imread(str(image_path))
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    h, w = gray.shape

    # Step 1: Detect plot area (largest rectangular region)
    edges = cv2.Canny(gray, 50, 150)
    contours, _ = cv2.findContours(
        edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )

    if contours:
        largest = max(contours, key=cv2.contourArea)
        x_start, y_start, plot_w, plot_h = cv2.boundingRect(largest)
    else:
        # Fallback: assume plot is central 80% of image
        x_start, y_start = int(w * 0.1), int(h * 0.1)
        plot_w, plot_h = int(w * 0.8), int(h * 0.8)

    # Step 2: Extract line pixels within plot area
    # Convert to HSV and isolate colored lines
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    plot_region = hsv[y_start:y_start + plot_h,
                      x_start:x_start + plot_w]

    # Detect non-white, non-gray pixels (likely the line)
    saturation = plot_region[:, :, 1]
    line_mask = saturation > 30  # Colored pixels

    # Step 3: Trace the line (column-wise median of colored pixels)
    data_points = []
    for col in range(plot_w):
        col_pixels = np.where(line_mask[:, col])[0]
        if len(col_pixels) > 0:
            # Use median y-position
            y_pixel = np.median(col_pixels)

            # Convert pixel to data coordinates
            x_frac = col / plot_w
            y_frac = 1 - y_pixel / plot_h  # Invert y-axis

            x_val = (x_range[0] + x_frac * (x_range[1] - x_range[0])
                     if x_range else x_frac)
            y_val = (y_range[0] + y_frac * (y_range[1] - y_range[0])
                     if y_range else y_frac)

            data_points.append({"x": x_val, "y": y_val})

    return pd.DataFrame(data_points)

45.5 Visual Sentiment Analysis

45.5.1 Image Sentiment in Financial News

News articles are accompanied by images that carry sentiment independent of the text. A photograph of a CEO smiling at a press conference conveys different information than the same CEO facing protesters. Obaid and Pukthuanthong (2022) demonstrate that the visual sentiment of Wall Street Journal photographs predicts market returns: days with more negative imagery precede lower returns.

We implement visual sentiment analysis using two approaches: a pre-trained sentiment classifier and a vision-language model that interprets images in financial context.

45.5.2 CNN-Based Visual Sentiment

def compute_visual_sentiment(image_paths, model_name="resnet50"):
    """
    Compute visual sentiment scores using a fine-tuned CNN.

    Uses features from a pre-trained CNN followed by a sentiment
    classifier trained on the Visual Sentiment Ontology (VSO)
    or similar dataset.

    Parameters
    ----------
    image_paths : list
        Paths to news images.

    Returns
    -------
    DataFrame : image_path, positive_score, negative_score, sentiment.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model, transform, dim = build_feature_extractor(model_name, device)

    # Extract features
    features = extract_features(image_paths, model, transform, device)

    # Simple sentiment model: use mean activation as proxy
    # (In practice, fine-tune on labeled financial images)
    # Higher mean activation in certain feature channels
    # correlates with positive/negative affect

    # Positive channels (empirically determined via validation)
    pos_channels = list(range(0, dim // 3))
    neg_channels = list(range(dim // 3, 2 * dim // 3))

    pos_scores = features[:, pos_channels].mean(axis=1)
    neg_scores = features[:, neg_channels].mean(axis=1)

    # Normalize to [0, 1]
    pos_norm = (pos_scores - pos_scores.min()) / (
        pos_scores.max() - pos_scores.min() + 1e-8
    )
    neg_norm = (neg_scores - neg_scores.min()) / (
        neg_scores.max() - neg_scores.min() + 1e-8
    )

    sentiment = pos_norm - neg_norm

    return pd.DataFrame({
        "image_path": image_paths,
        "positive_score": pos_norm,
        "negative_score": neg_norm,
        "net_sentiment": sentiment
    })

45.5.3 Vision-Language Models for Financial Image Understanding

The most powerful approach to financial image analysis uses vision-language models (VLMs), which jointly process images and text. Models such as CLIP (Radford et al. 2021), BLIP-2 (Li et al. 2023), and GPT-4V can be prompted to interpret financial images in context. For instance, given an aerial photograph of a factory, a VLM can answer “Is this factory operating at full capacity?” or “Is there visible construction of additional facilities?”

def vlm_financial_analysis(image_path, prompt, model_name="clip"):
    """
    Use a vision-language model to analyze a financial image.

    Parameters
    ----------
    image_path : str
        Path to image.
    prompt : str
        Financial analysis prompt.
    model_name : str
        'clip' for zero-shot classification,
        'blip2' for visual question answering.

    Returns
    -------
    dict : Model output (scores or text).
    """
    img = Image.open(image_path).convert("RGB")

    if model_name == "clip":
        from transformers import CLIPProcessor, CLIPModel

        clip_model = CLIPModel.from_pretrained(
            "openai/clip-vit-base-patch32"
        )
        processor = CLIPProcessor.from_pretrained(
            "openai/clip-vit-base-patch32"
        )

        # Zero-shot classification with financial labels
        labels = [
            "busy commercial area with many customers",
            "empty commercial area with few customers",
            "active construction site with workers",
            "idle construction site without activity",
            "healthy green crops in agricultural field",
            "damaged or dry crops in agricultural field",
            "busy port with many ships and containers",
            "quiet port with few ships"
        ]

        inputs = processor(
            text=labels,
            images=img,
            return_tensors="pt",
            padding=True
        )

        with torch.no_grad():
            outputs = clip_model(**inputs)
            logits = outputs.logits_per_image.squeeze()
            probs = torch.softmax(logits, dim=0)

        results = {
            label: prob.item()
            for label, prob in zip(labels, probs)
        }

        return {"scores": results, "top_label": max(results, key=results.get)}

    elif model_name == "blip2":
        from transformers import Blip2Processor, Blip2ForConditionalGeneration

        processor = Blip2Processor.from_pretrained(
            "Salesforce/blip2-opt-2.7b"
        )
        model = Blip2ForConditionalGeneration.from_pretrained(
            "Salesforce/blip2-opt-2.7b",
            torch_dtype=torch.float16
        )

        inputs = processor(images=img, text=prompt, return_tensors="pt")

        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_length=100)
            answer = processor.decode(
                generated_ids[0], skip_special_tokens=True
            )

        return {"answer": answer}

# Construct daily visual sentiment index from news images
# Source: Vietnamese financial news sites (VnExpress, CafeF, etc.)
news_images = dc.get_news_images(
    start_date="2018-01-01",
    end_date="2024-12-31",
    source=["vnexpress_finance", "cafef"]
)

# Aggregate daily visual sentiment
daily_sentiment = (
    news_images.groupby("date")
    .agg(
        visual_sentiment=("net_sentiment", "mean"),
        n_images=("net_sentiment", "count"),
        pct_negative=("net_sentiment", lambda x: (x < 0).mean())
    )
    .reset_index()
)

# Merge with market returns
market_returns = dc.get_market_returns(
    start_date="2018-01-01",
    end_date="2024-12-31",
    frequency="daily"
)

sentiment_returns = daily_sentiment.merge(
    market_returns[["date", "mkt_ret"]],
    on="date",
    how="inner"
)

# Lead-lag analysis: does visual sentiment predict next-day returns?
sentiment_returns = sentiment_returns.sort_values("date")
sentiment_returns["mkt_ret_lead1"] = sentiment_returns["mkt_ret"].shift(-1)

Table 45.3: Visual Sentiment and Market Return Predictability

# Regression: next-day return on today's visual sentiment
sr_clean = sentiment_returns.dropna(
    subset=["mkt_ret_lead1", "visual_sentiment", "mkt_ret"]
)

model_sent = sm.OLS(
    sr_clean["mkt_ret_lead1"],
    sm.add_constant(sr_clean[["visual_sentiment", "mkt_ret"]])
).fit(cov_type="HAC", cov_kwds={"maxlags": 5})

sent_results = pd.DataFrame({
    "Coefficient": model_sent.params.round(6),
    "Std Error": model_sent.bse.round(6),
    "t-stat": model_sent.tvalues.round(3),
    "p-value": model_sent.pvalues.round(4)
})
sent_results

45.6 Multimodal Fusion: Combining Image and Text

45.6.1 Why Multimodal?

Text and images capture different dimensions of the same underlying economic reality. An earnings report describes financial performance in words and numbers; the accompanying photographs show factories, products, and management. A news article about a port describes trade volumes in text; the satellite image shows actual ship positions. Combining both modalities yields a richer representation than either alone.

The fusion architecture depends on the application:

Early fusion. Concatenate image features $\mathbf{z}^{\text{img}}$ and text features $\mathbf{z}^{\text{txt}}$ into a single vector $[\mathbf{z}^{\text{img}}; \mathbf{z}^{\text{txt}}]$ before prediction. Simple but ignores cross-modal interactions.

Late fusion. Train separate models on each modality and combine predictions: $\hat{y} = \alpha \hat{y}^{\text{img}} + (1-\alpha) \hat{y}^{\text{txt}}$. Robust but cannot learn cross-modal features.

Cross-attention fusion. Use transformer cross-attention to let each modality attend to the other. Most powerful but requires more data and computation.

\[ \mathbf{z}^{\text{fused}} = \text{CrossAttention}(\mathbf{z}^{\text{img}}, \mathbf{z}^{\text{txt}}) = \text{softmax}\left(\frac{\mathbf{Q}^{\text{img}} (\mathbf{K}^{\text{txt}})^\top}{\sqrt{d}}\right) \mathbf{V}^{\text{txt}} \tag{45.3}\]

class MultimodalFusionModel(nn.Module):
    """
    Multimodal fusion model combining image and text features
    for financial prediction.

    Supports early fusion, late fusion, and cross-attention.
    """

    def __init__(self, img_dim=2048, txt_dim=768, hidden_dim=256,
                 fusion="early", n_heads=4):
        super().__init__()
        self.fusion = fusion

        # Image projection
        self.img_proj = nn.Sequential(
            nn.Linear(img_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2)
        )

        # Text projection
        self.txt_proj = nn.Sequential(
            nn.Linear(txt_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(0.2)
        )

        if fusion == "early":
            self.head = nn.Sequential(
                nn.Linear(hidden_dim * 2, hidden_dim),
                nn.ReLU(),
                nn.Dropout(0.2),
                nn.Linear(hidden_dim, 1)
            )
        elif fusion == "late":
            self.img_head = nn.Linear(hidden_dim, 1)
            self.txt_head = nn.Linear(hidden_dim, 1)
            self.alpha = nn.Parameter(torch.tensor(0.5))
        elif fusion == "cross_attention":
            self.cross_attn = nn.MultiheadAttention(
                embed_dim=hidden_dim,
                num_heads=n_heads,
                batch_first=True
            )
            self.head = nn.Sequential(
                nn.Linear(hidden_dim, hidden_dim // 2),
                nn.ReLU(),
                nn.Linear(hidden_dim // 2, 1)
            )

    def forward(self, img_features, txt_features):
        img_h = self.img_proj(img_features)
        txt_h = self.txt_proj(txt_features)

        if self.fusion == "early":
            combined = torch.cat([img_h, txt_h], dim=-1)
            return self.head(combined).squeeze(-1)

        elif self.fusion == "late":
            img_pred = self.img_head(img_h).squeeze(-1)
            txt_pred = self.txt_head(txt_h).squeeze(-1)
            alpha = torch.sigmoid(self.alpha)
            return alpha * img_pred + (1 - alpha) * txt_pred

        elif self.fusion == "cross_attention":
            # Image attends to text
            img_h_unsq = img_h.unsqueeze(1)  # (B, 1, D)
            txt_h_unsq = txt_h.unsqueeze(1)

            attn_out, _ = self.cross_attn(
                img_h_unsq, txt_h_unsq, txt_h_unsq
            )
            return self.head(attn_out.squeeze(1)).squeeze(-1)

def run_multimodal_experiment(image_features, text_features, returns,
                               fusion_types=["early", "late",
                                             "cross_attention"]):
    """
    Compare multimodal fusion strategies for return prediction.

    Parameters
    ----------
    image_features : np.ndarray
        Image feature matrix (N x img_dim).
    text_features : np.ndarray
        Text feature matrix (N x txt_dim).
    returns : np.ndarray
        Target returns (N,).
    fusion_types : list
        Fusion strategies to compare.

    Returns
    -------
    DataFrame : R², MSE, Sharpe for each strategy.
    """
    from sklearn.model_selection import TimeSeriesSplit

    n = len(returns)
    tscv = TimeSeriesSplit(n_splits=5)

    results = []

    for fusion in fusion_types:
        fold_r2s = []

        for train_idx, test_idx in tscv.split(returns):
            # Convert to tensors
            X_img_train = torch.tensor(
                image_features[train_idx], dtype=torch.float32
            )
            X_txt_train = torch.tensor(
                text_features[train_idx], dtype=torch.float32
            )
            y_train = torch.tensor(
                returns[train_idx], dtype=torch.float32
            )

            X_img_test = torch.tensor(
                image_features[test_idx], dtype=torch.float32
            )
            X_txt_test = torch.tensor(
                text_features[test_idx], dtype=torch.float32
            )
            y_test = returns[test_idx]

            # Build and train model
            model = MultimodalFusionModel(
                img_dim=image_features.shape[1],
                txt_dim=text_features.shape[1],
                fusion=fusion
            )

            optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
            loss_fn = nn.MSELoss()

            model.train()
            for epoch in range(50):
                optimizer.zero_grad()
                pred = model(X_img_train, X_txt_train)
                loss = loss_fn(pred, y_train)
                loss.backward()
                optimizer.step()

            # Evaluate
            model.eval()
            with torch.no_grad():
                y_pred = model(X_img_test, X_txt_test).numpy()

            ss_res = np.sum((y_test - y_pred) ** 2)
            ss_tot = np.sum((y_test - y_test.mean()) ** 2)
            r2 = 1 - ss_res / ss_tot if ss_tot > 0 else 0

            fold_r2s.append(r2)

        results.append({
            "fusion": fusion,
            "r2_mean": np.mean(fold_r2s),
            "r2_std": np.std(fold_r2s)
        })

    # Add unimodal baselines
    for modality, features in [("image_only", image_features),
                                ("text_only", text_features)]:
        from sklearn.linear_model import RidgeCV
        fold_r2s = []
        for train_idx, test_idx in tscv.split(returns):
            ridge = RidgeCV(alphas=np.logspace(-3, 3, 10))
            ridge.fit(features[train_idx], returns[train_idx])
            y_pred = ridge.predict(features[test_idx])
            y_test = returns[test_idx]
            ss_res = np.sum((y_test - y_pred) ** 2)
            ss_tot = np.sum((y_test - y_test.mean()) ** 2)
            fold_r2s.append(1 - ss_res / ss_tot if ss_tot > 0 else 0)

        results.append({
            "fusion": modality,
            "r2_mean": np.mean(fold_r2s),
            "r2_std": np.std(fold_r2s)
        })

    return pd.DataFrame(results)

45.6.2 Practical Considerations for Vietnamese Markets

Multimodal analysis in Vietnamese markets faces several practical considerations:

Data alignment. Satellite images, news articles, and market data operate on different temporal frequencies and spatial resolutions. Satellite composites are available weekly or biweekly; news is daily; trading is intraday. Proper alignment requires specifying the information set available to an investor at the time of the trading decision to avoid look-ahead bias.

Label scarcity. Supervised learning requires labeled data (e.g., images annotated with economic outcomes). In Vietnam, ground-truth labels (actual retail sales, actual crop yields, actual port throughput) arrive with significant lags and often lack the granularity to match satellite resolution. Semi-supervised and self-supervised approaches are therefore essential.

Regulatory considerations. High-resolution satellite imagery of specific commercial or military installations may be restricted. Researchers should verify that their imagery sources comply with Vietnamese regulations on geospatial data.

Computational cost. Processing satellite tiles through CNNs is computationally intensive. A single Sentinel-2 tile at 10m resolution covering Ho Chi Minh City contains approximately $10{,}980 \times 10{,}980$ pixels per band. Tiling into $224 \times 224$ patches for CNN input generates $\sim 2{,}400$ patches per tile, each requiring a forward pass through the network.

Table 45.4: Image Data Sources for Vietnamese Financial Applications

Application	Image Source	Resolution	Frequency	Vietnamese Availability
Nighttime luminosity	VIIRS/DMSP	500m	Monthly	Free (NOAA/EOG)
Crop health	MODIS/Sentinel-2	250m/10m	16-day/5-day	Free (NASA/ESA)
Port/ship detection	Sentinel-1 (SAR)	10m	12-day	Free (ESA Copernicus)
Construction monitoring	Commercial (Maxar)	30cm	On demand	Paid ($)
Urban density	Sentinel-2	10m	5-day	Free (ESA)
Document OCR	Corporate filings	N/A	Event-driven	DataCore.vn
News images	Financial media	N/A	Daily	Web scraping

45.7 Summary

This chapter extended the alternative data toolkit from text (the previous chapter) to images. We demonstrated five distinct application domains for visual data in Vietnamese financial markets.

First, satellite and geospatial imagery provides high-frequency, spatially granular economic signals that lead official statistics. Nighttime luminosity serves as a provincial GDP proxy with cross-sectional $R^2$ exceeding 0.7; NDVI crop health indices predict agricultural firm returns; and CNN features extracted from satellite tiles enable rich spatial representations of economic activity.

Second, document image analysis solves the practical problem of extracting structured data from Vietnamese financial filings that arrive as scanned images. The pipeline (e.g., OCR with Vietnamese-optimized engines, layout analysis, table extraction, and LayoutLM-based document understanding) converts unstructured pixels into the structured financial data that all downstream analyses require.

Third, chart digitization recovers numerical data series from visual representations, extending historical coverage and enabling systematic consumption of analyst outputs. Fourth, visual sentiment analysis from news imagery provides a signal dimension orthogonal to textual sentiment, with potential predictive power for market returns.

Fifth, multimodal fusion (combining image and text representations via early, late, or cross-attention architectures) yields richer predictive models than either modality alone. The practical benefit of multimodal approaches scales with the diversity and quality of available data, making it increasingly relevant as Vietnamese alternative data ecosystems mature.

The common thread across all applications is the transformation pipeline: raw pixel tensor $\to$ feature representation (via CNN, ViT, or VLM) $\to$ financial signal $\to$ economic interpretation. The choice of architecture and the quality of the domain adaptation determine whether the resulting signal has genuine predictive content or merely captures noise.

Baek, Jeonghun, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019. “What Is Wrong with Scene Text Recognition Model Comparisons? Dataset and Model Analysis.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4715–23.

Donaldson, Dave. 2018. “Railroads of the Raj: Estimating the Impact of Transportation Infrastructure.” American Economic Review 108 (4-5): 899–934.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv Preprint arXiv:2010.11929.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–78.

Henderson, J Vernon, Adam Storeygard, and David N Weil. 2012. “Measuring Economic Growth from Outer Space.” American Economic Review 102 (2): 994–1028.

Huang, Yupan, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. “Layoutlmv3: Pre-Training for Document Ai with Unified Text and Image Masking.” In Proceedings of the 30th ACM International Conference on Multimedia, 4083–91.

Jean, Neal, Marshall Burke, Michael Xie, W Matthew Alampay Davis, David B Lobell, and Stefano Ermon. 2016. “Combining Satellite Imagery and Machine Learning to Predict Poverty.” Science 353 (6301): 790–94.

Li, Junnan, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. “Blip-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models.” In International Conference on Machine Learning, 19730–42. PMLR.

Obaid, Khaled, and Kuntara Pukthuanthong. 2022. “A Picture Is Worth a Thousand Words: Measuring Investor Sentiment by Combining Machine Learning and Photos from News.” Journal of Financial Economics 144 (1): 273–97.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In International Conference on Machine Learning, 8748–63. PmLR.

Tan, Mingxing, and Quoc Le. 2019. “Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.” In International Conference on Machine Learning, 6105–14. PMLR.

# Image and Visual Data in Finance The previous chapter demonstrated how unstructured text (e.g., earnings reports, news articles, business descriptions) can be transformed into structured signals for financial analysis. This chapter extends the alternative data toolkit to a second modality: images. Visual data is abundant in financial contexts yet systematically underexploited. Satellite photographs reveal real economic activity (e.g., parking lot occupancy at retail locations, construction progress at industrial sites, nighttime luminosity as a proxy for regional GDP, ship traffic at port terminals, crop health across agricultural zones). Corporate documents arrive as scanned PDFs whose tables and figures resist standard text extraction. Financial charts encode information that analysts interpret visually but that systematic strategies cannot consume without digitization. And the visual content of social media, advertising, and product imagery carries sentiment and brand signals that complement textual analysis. The core challenge is representational: an image is a three-dimensional tensor of pixel intensities with no inherent semantic structure. Converting this raw array into a financial signal (i.e., a number that predicts returns, measures risk, or proxies for economic activity) requires either hand-crafted feature engineering or learned representations via deep convolutional neural networks (CNNs) and vision transformers (ViTs). This chapter covers both approaches. We organize the material around five application domains, each with distinct data sources, modeling requirements, and economic motivations. First, satellite and geospatial imagery for nowcasting economic activity. Second, document image analysis for extracting structured data from Vietnamese financial filings. Third, chart and figure digitization for systematic backtesting. Fourth, visual sentiment analysis from social and news media. Fifth, multimodal fusion, combining image and text signals into joint predictive models. Vietnamese markets present particular opportunities in this space. Satellite imagery is especially informative in an economy with large agricultural and manufacturing sectors where ground-truth data arrives with significant lags. Vietnamese financial filings are often distributed as scanned images rather than machine-readable formats, making document AI essential rather than optional. And the rapid urbanization visible in construction and infrastructure imagery provides high-frequency proxies for macroeconomic momentum that official statistics cannot match. ```{python} #| label: setup #| message: false import pandas as pd import numpy as np from pathlib import Path import warnings warnings.filterwarnings("ignore") # Core image processing from PIL import Image import io # Deep learning import torch import torch.nn as nn import torchvision.transforms as transforms import torchvision.models as models # Visualization import plotnine as p9 from mizani.formatters import percent_format, comma_format import matplotlib.pyplot as plt # Statistical analysis from scipy import stats import statsmodels.api as sm import statsmodels.formula.api as smf from linearmodels.panel import PanelOLS ``` ## Foundations: From Pixels to Financial Signals ### Image Representation A digital image is a function $I: \{1, \ldots, H\} \times \{1, \ldots, W\} \times \{1, \ldots, C\} \rightarrow [0, 255]$ mapping spatial coordinates and color channels to intensity values. For an RGB image of height $H$ and width $W$, the representation is a tensor $\mathbf{I} \in \mathbb{R}^{H \times W \times 3}$. A single $224 \times 224$ RGB image (i.e., the standard input for modern CNNs) contains $224 \times 224 \times 3 = 150{,}528$ dimensions. This extreme dimensionality, combined with spatial structure (nearby pixels are correlated), makes images fundamentally different from tabular financial data and demands specialized architectures. The key insight of convolutional neural networks is parameter sharing via local filters. A convolutional layer applies a kernel $\mathbf{K} \in \mathbb{R}^{k \times k \times C_{\text{in}}}$ to produce a feature map: $$ (\mathbf{I} * \mathbf{K})(i, j) = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} \sum_{c=1}^{C_{\text{in}}} I(i+m, j+n, c) \cdot K(m, n, c) $$ {#eq-convolution} By stacking convolutional layers with nonlinearities and pooling operations, the network builds a hierarchy of representations: early layers detect edges and textures; middle layers detect parts and patterns; deep layers detect objects and scenes. The final layer output $\mathbf{z} \in \mathbb{R}^{d}$ (with $d$ typically 512-2048) is a compact representation of the image's semantic content, which can be used directly as a feature vector for financial prediction. Financial images span a wide range of resolutions and modalities: | Source | Resolution | Channels | Typical Size | Update Frequency | |---------------|---------------|---------------|---------------|---------------| | Sentinel-2 satellite | 10m/pixel | 13 bands | 10,980 × 10,980 | 5 days | | Planet Labs | 3m/pixel | 4 bands | 4,000 × 4,000 | Daily | | VIIRS nightlights | 500m/pixel | 1 (DNB) | 3,000 × 1,800 | Monthly composite | | Annual report scan | 300 DPI | 3 (RGB) | 2,480 × 3,508 | Annual | | CEO photograph | Varies | 3 (RGB) | 500 × 500 | Annual | | News photograph | Varies | 3 (RGB) | 800 × 600 | Real-time | | Financial chart | Varies | 3 (RGB) | 1,000 × 600 | Real-time | : Image Data Sources for Vietnamese Financial Markets {#tbl-image-sources} ### Transfer Learning for Finance Training a CNN from scratch requires millions of labeled images, which is far more than any financial application can provide. Transfer learning solves this by using networks pre-trained on ImageNet (1.2 million images, 1,000 classes) as feature extractors. The pre-trained network has already learned generic visual representations (edges, textures, shapes, objects); we simply replace the final classification layer with a task-specific head. Formally, let $f_{\boldsymbol{\theta}}(\mathbf{I})$ denote a pre-trained network with parameters $\boldsymbol{\theta}$ partitioned into feature extractor $\boldsymbol{\theta}_{\text{feat}}$ and classifier $\boldsymbol{\theta}_{\text{cls}}$. For financial applications, we: 1. **Feature extraction**: Freeze $\boldsymbol{\theta}_{\text{feat}}$, extract $\mathbf{z} = f_{\boldsymbol{\theta}_{\text{feat}}}(\mathbf{I})$, and train a simple model (linear regression, gradient boosting) on $\mathbf{z}$. 2. **Fine-tuning**: Initialize from $\boldsymbol{\theta}$ and train all parameters on the financial task with a small learning rate to avoid catastrophic forgetting. The key architectural families are: **ResNet** \[@he2016deep\]. Residual connections ($y = F(x) + x$) enable training of very deep networks (50-152 layers). The skip connection solves the vanishing gradient problem. ResNet-50 produces a 2,048-dimensional feature vector from the penultimate layer. **EfficientNet** \[@tan2019efficientnet\]. Compound scaling of depth, width, and resolution simultaneously. EfficientNet-B0 achieves ResNet-50 accuracy with 5.3M parameters (vs. 25.6M), making it practical for processing thousands of satellite tiles. **Vision Transformer (ViT)** \[@dosovitskiy2020image\]. Treats an image as a sequence of $16 \times 16$ patches, processes them through a standard Transformer encoder. ViT-B/16 produces a 768-dimensional embedding. Particularly effective for document images where spatial relationships between elements (tables, headers, text blocks) matter. ```{python} #| label: feature-extractor #| eval: false # DataCore.vn API from datacore import DataCore dc = DataCore() def build_feature_extractor(model_name="resnet50", device="cpu"): """ Build a pre-trained CNN feature extractor. Parameters ---------- model_name : str One of 'resnet50', 'efficientnet_b0', 'vit_b_16'. device : str 'cpu' or 'cuda'. Returns ------- model : nn.Module Feature extraction model. transform : transforms.Compose Image preprocessing pipeline. """ if model_name == "resnet50": weights = models.ResNet50_Weights.IMAGENET1K_V2 model = models.resnet50(weights=weights) model.fc = nn.Identity() # Remove classification head dim = 2048 elif model_name == "efficientnet_b0": weights = models.EfficientNet_B0_Weights.IMAGENET1K_V1 model = models.efficientnet_b0(weights=weights) model.classifier = nn.Identity() dim = 1280 elif model_name == "vit_b_16": weights = models.ViT_B_16_Weights.IMAGENET1K_V1 model = models.vit_b_16(weights=weights) model.heads = nn.Identity() dim = 768 model = model.to(device).eval() transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] ) ]) return model, transform, dim def extract_features(image_paths, model, transform, device="cpu", batch_size=32): """ Extract deep features from a list of images. Parameters ---------- image_paths : list Paths to image files. model : nn.Module Feature extraction model. transform : transforms.Compose Preprocessing pipeline. Returns ------- np.ndarray : Feature matrix (n_images x feature_dim). """ features = [] for i in range(0, len(image_paths), batch_size): batch_paths = image_paths[i:i + batch_size] batch_tensors = [] for path in batch_paths: try: img = Image.open(path).convert("RGB") tensor = transform(img) batch_tensors.append(tensor) except Exception: batch_tensors.append(torch.zeros(3, 224, 224)) batch = torch.stack(batch_tensors).to(device) with torch.no_grad(): batch_features = model(batch).cpu().numpy() features.append(batch_features) return np.vstack(features) ``` ## Satellite and Geospatial Imagery ### Economic Activity from Space Satellite imagery provides high-frequency, spatially granular measurements of economic activity that are independent of and often lead official statistics. The foundational work of @henderson2012measuring demonstrated that nighttime luminosity, measured by the Defense Meteorological Satellite Program (DMSP), is a reliable proxy for GDP, particularly in countries where official statistics are noisy or delayed. @donaldson2016view survey the broad applications of satellite data in economics, from agricultural productivity to urban growth to conflict measurement. @donaldson2018railroads use satellite-derived agricultural output measures to study the welfare gains from railroads in colonial India. @jean2016combining combine daytime satellite imagery with CNNs to predict poverty from space with $r^2 > 0.7$. The political economy literature exploits satellite data to answer questions impossible with official statistics. @hodler2014regional use nighttime lights to show that subnational regions politically aligned with the national leader receive differentially more economic activity (e.g., a finding on regional favoritism that has direct relevance for Vietnam's province-level investment allocation). @henderson2018global extend the lights framework to study the global distribution of economic activity, showing how geography, history, and trade jointly determine the spatial pattern of prosperity. For financial applications, the key insight is that satellite data arrives faster than corporate earnings or government statistics. A retailer's quarterly revenue is reported 4-8 weeks after the quarter ends; satellite imagery of its parking lots is available within days. This temporal advantage creates a natural use case for nowcasting (i.e., estimating current economic conditions before official data arrives) and for constructing trading signals based on information that is public but costly to process. ### Application 1: Nighttime Luminosity and Provincial GDP Vietnam's General Statistics Office (GSO) publishes provincial GDP with a lag of several months. Nighttime luminosity from the VIIRS (Visible Infrared Imaging Radiometer Suite) sensor provides a near-real-time alternative. We construct a firm-level exposure measure by linking each listed firm's registered location to the luminosity of its province. ```{python} #| label: nightlight-data #| eval: false # Load nighttime luminosity data (VIIRS monthly composites) # Source: Earth Observation Group (EOG) / NOAA nightlights = dc.get_nightlight_data( start_date="2014-01-01", end_date="2024-12-31", resolution="province" ) # Load firm location data firm_locations = dc.get_firm_locations() # Provincial GDP from GSO provincial_gdp = dc.get_provincial_gdp( start_date="2014-01-01", end_date="2024-12-31" ) print(f"Nightlight observations: {len(nightlights)}") print(f"Provinces covered: {nightlights['province'].nunique()}") print(f"Firms with location: {firm_locations['ticker'].nunique()}") ``` ```{python} #| label: nightlight-gdp-validation #| eval: false # Validate: does nightlight predict provincial GDP? nl_gdp = nightlights.merge( provincial_gdp, on=["province", "year", "quarter"], how="inner" ) # Log-log specification (standard in the literature) nl_gdp["ln_luminosity"] = np.log(nl_gdp["mean_radiance"].clip(lower=0.01)) nl_gdp["ln_gdp"] = np.log(nl_gdp["provincial_gdp"].clip(lower=1)) # Cross-sectional regression by year validation_results = [] for year in nl_gdp["year"].unique(): subset = nl_gdp[nl_gdp["year"] == year] if len(subset) < 20: continue model = sm.OLS( subset["ln_gdp"], sm.add_constant(subset["ln_luminosity"]) ).fit() validation_results.append({ "year": year, "beta": model.params.iloc[1], "r_squared": model.rsquared, "n_provinces": int(model.nobs) }) validation_df = pd.DataFrame(validation_results) print(f"Avg R² (ln GDP ~ ln Luminosity): {validation_df['r_squared'].mean():.3f}") ``` ```{python} #| label: fig-nightlight-gdp #| eval: false #| fig-cap: "Nighttime Luminosity as a Predictor of Provincial GDP" ( p9.ggplot(nl_gdp[nl_gdp["year"] == 2023], p9.aes(x="ln_luminosity", y="ln_gdp")) + p9.geom_point(color="#2E5090", alpha=0.6, size=2) + p9.geom_smooth(method="lm", color="#C0392B", se=True, size=0.8) + p9.labs( x="ln(Mean Nighttime Radiance)", y="ln(Provincial GDP)", title="Nighttime Luminosity vs. Provincial GDP (2023)" ) + p9.theme_minimal() + p9.theme(figure_size=(10, 6)) ) ``` ```{python} #| label: nightlight-firm-signal #| eval: false # Construct firm-level nightlight signal # Logic: firms in provinces with accelerating luminosity # are experiencing positive local economic conditions nl_growth = nightlights.copy().sort_values(["province", "year", "quarter"]) # Year-over-year luminosity growth by province nl_growth["ln_radiance"] = np.log(nl_growth["mean_radiance"].clip(lower=0.01)) nl_growth["ln_radiance_lag4"] = nl_growth.groupby( "province" )["ln_radiance"].shift(4) nl_growth["nl_growth"] = nl_growth["ln_radiance"] - nl_growth["ln_radiance_lag4"] # Merge with firms via province firm_nl = firm_locations[["ticker", "province"]].merge( nl_growth[["province", "year", "quarter", "nl_growth"]], on="province", how="inner" ) # Merge with stock returns monthly_returns = dc.get_monthly_returns( start_date="2014-01-01", end_date="2024-12-31" ) monthly_returns["year"] = monthly_returns["date"].dt.year monthly_returns["quarter"] = monthly_returns["date"].dt.quarter returns_with_nl = monthly_returns.merge( firm_nl, on=["ticker", "year", "quarter"], how="inner" ) ``` ```{python} #| label: nightlight-portfolio #| eval: false # Portfolio sort: quintiles on provincial nightlight growth returns_with_nl["nl_quintile"] = returns_with_nl.groupby("date")[ "nl_growth" ].transform(lambda x: pd.qcut(x, 5, labels=[1, 2, 3, 4, 5], duplicates="drop")) nl_port_returns = ( returns_with_nl.groupby(["date", "nl_quintile"]) .agg(port_ret=("ret", "mean")) .reset_index() ) nl_wide = nl_port_returns.pivot( index="date", columns="nl_quintile", values="port_ret" ) nl_wide["L-S"] = nl_wide[5] - nl_wide[1] # High NL growth - Low ``` ```{python} #| label: tbl-nightlight-portfolios #| eval: false #| tbl-cap: "Nighttime Luminosity Growth Quintile Portfolio Returns" nl_summary = nl_wide.describe().T[["mean", "std"]].copy() nl_summary["mean_ann"] = nl_summary["mean"] * 12 nl_summary["sharpe"] = ( nl_summary["mean_ann"] / (nl_summary["std"] * np.sqrt(12)) ) for col in nl_wide.columns: t_stat = nl_wide[col].mean() / ( nl_wide[col].std() / np.sqrt(len(nl_wide.dropna())) ) nl_summary.loc[col, "t_stat"] = t_stat nl_summary = nl_summary[["mean_ann", "sharpe", "t_stat"]].round(4) nl_summary.columns = ["Ann. Return", "Sharpe", "t-stat"] nl_summary ``` ### Application 2: Satellite Imagery for Sector Nowcasting Beyond luminosity, daytime satellite imagery provides sector-specific signals. We implement three channels relevant to the Vietnamese economy. **Port activity.** Vietnam is a major export-oriented economy. Satellite imagery of container ports (Cát Lái, Hải Phòng) captures trade throughput before customs statistics are released. Ship detection algorithms applied to synthetic aperture radar (SAR) imagery count vessels and estimate cargo volumes. **Construction progress.** Real estate and construction constitute a significant fraction of Vietnamese GDP and market capitalization. Change detection algorithms applied to high-resolution optical imagery identify construction starts, completion rates, and land-use conversion. **Agricultural monitoring.** Vietnam is a leading exporter of rice, coffee, rubber, and seafood. The Normalized Difference Vegetation Index (NDVI), computed from multispectral satellite data, provides crop health assessments: $$ \text{NDVI} = \frac{\rho_{\text{NIR}} - \rho_{\text{Red}}}{\rho_{\text{NIR}} + \rho_{\text{Red}}} $$ {#eq-ndvi} where $\rho_{\text{NIR}}$ and $\rho_{\text{Red}}$ are reflectance in the near-infrared and red bands. NDVI ranges from $-1$ to $+1$, with values above 0.3 indicating healthy vegetation. Deviations from seasonal norms proxy for crop yield surprises. ```{python} #| label: ndvi-agriculture #| eval: false # Load NDVI data for Vietnamese agricultural regions # Source: MODIS/Terra (MOD13Q1, 250m resolution, 16-day composites) ndvi_data = dc.get_ndvi_data( start_date="2014-01-01", end_date="2024-12-31", regions=["mekong_delta", "central_highlands", "red_river_delta", "southeast"] ) # Compute NDVI anomaly: deviation from 5-year seasonal average ndvi_data["month"] = ndvi_data["date"].dt.month seasonal_mean = ( ndvi_data.groupby(["region", "month"]) ["mean_ndvi"].transform( lambda x: x.rolling(5 * 12, min_periods=12).mean() ) ) ndvi_data["ndvi_anomaly"] = ndvi_data["mean_ndvi"] - seasonal_mean # Agricultural sector firms agri_firms = dc.get_firms_by_sector(sector="agriculture") # Link NDVI anomaly to agricultural firm returns agri_returns = monthly_returns[ monthly_returns["ticker"].isin(agri_firms["ticker"]) ].copy() agri_returns["month"] = agri_returns["date"].dt.month agri_returns["year"] = agri_returns["date"].dt.year # Regional NDVI aggregation (Mekong Delta for rice firms, etc.) mekong_ndvi = ndvi_data[ndvi_data["region"] == "mekong_delta"].copy() mekong_ndvi["year"] = mekong_ndvi["date"].dt.year mekong_ndvi["month"] = mekong_ndvi["date"].dt.month mekong_monthly = ( mekong_ndvi.groupby(["year", "month"]) .agg(ndvi_anomaly=("ndvi_anomaly", "mean")) .reset_index() ) ``` ```{python} #| label: fig-ndvi-timeseries #| eval: false #| fig-cap: "NDVI Anomaly in Mekong Delta: Crop Health Signal" ndvi_plot = ndvi_data[ndvi_data["region"] == "mekong_delta"].copy() ( p9.ggplot(ndvi_plot, p9.aes(x="date", y="ndvi_anomaly")) + p9.geom_line(color="#27AE60", alpha=0.5, size=0.4) + p9.geom_smooth(method="lowess", color="#2E5090", size=1, se=False) + p9.geom_hline(yintercept=0, linetype="dashed", color="gray") + p9.labs( x="", y="NDVI Anomaly", title="Mekong Delta Vegetation Health: Deviation from Seasonal Norm" ) + p9.theme_minimal() + p9.theme(figure_size=(12, 5)) ) ``` ```{python} #| label: ndvi-return-prediction #| eval: false # Panel regression: agricultural firm returns on NDVI anomaly agri_panel = agri_returns.merge( mekong_monthly, on=["year", "month"], how="inner" ) # Lagged NDVI anomaly (one month) agri_panel = agri_panel.sort_values(["ticker", "date"]) agri_panel["ndvi_lag1"] = agri_panel.groupby( "ticker" )["ndvi_anomaly"].shift(1) agri_clean = agri_panel.dropna( subset=["ret", "ndvi_lag1"] ).set_index(["ticker", "date"]) model_ndvi = PanelOLS( agri_clean["ret"], agri_clean[["ndvi_lag1"]], entity_effects=True, time_effects=True, check_rank=False ).fit(cov_type="clustered", cluster_entity=True) agri_clean = agri_clean.reset_index() print(f"NDVI → Agricultural Returns:") print(f" β(NDVI_lag): {model_ndvi.params['ndvi_lag1']:.4f}") print(f" t-stat: {model_ndvi.tstats['ndvi_lag1']:.3f}") print(f" R² (within): {model_ndvi.rsquared_within:.4f}") ``` ### Satellite Feature Extraction with CNNs For raw satellite imagery (rather than pre-computed indices like NDVI), we use transfer learning from CNNs to extract spatial features. The approach follows @jean2016combining: use a CNN pre-trained on ImageNet to extract feature vectors from satellite tiles, then regress economic outcomes on these features. ```{python} #| label: satellite-cnn-pipeline #| eval: false def satellite_feature_pipeline(image_dir, model_name="resnet50"): """ Extract CNN features from satellite image tiles. Parameters ---------- image_dir : str or Path Directory containing satellite tiles (PNG/TIFF). model_name : str Pre-trained model to use. Returns ------- DataFrame : image_id, feature vector columns. """ image_dir = Path(image_dir) image_paths = sorted(image_dir.glob("*.png")) + sorted( image_dir.glob("*.tif") ) if not image_paths: print("No images found.") return pd.DataFrame() device = "cuda" if torch.cuda.is_available() else "cpu" model, transform, dim = build_feature_extractor(model_name, device) features = extract_features(image_paths, model, transform, device) # Create DataFrame feature_cols = [f"feat_{i}" for i in range(dim)] df = pd.DataFrame(features, columns=feature_cols) df["image_id"] = [p.stem for p in image_paths] return df def predict_economic_activity(features_df, labels_df, label_col, n_components=50): """ Predict economic activity from satellite image features. Uses PCA for dimensionality reduction, then ridge regression. Parameters ---------- features_df : DataFrame CNN features with image_id. labels_df : DataFrame Economic outcomes with image_id. label_col : str Target variable column name. n_components : int PCA components to retain. Returns ------- dict : R², coefficients, cross-validated performance. """ from sklearn.decomposition import PCA from sklearn.linear_model import RidgeCV from sklearn.model_selection import cross_val_score merged = features_df.merge(labels_df, on="image_id") feature_cols = [c for c in features_df.columns if c.startswith("feat_")] X = merged[feature_cols].values y = merged[label_col].values # PCA pca = PCA(n_components=n_components) X_pca = pca.fit_transform(X) var_explained = pca.explained_variance_ratio_.sum() # Ridge regression with cross-validation ridge = RidgeCV(alphas=np.logspace(-3, 3, 20), cv=5) cv_scores = cross_val_score(ridge, X_pca, y, cv=5, scoring="r2") ridge.fit(X_pca, y) return { "r2_cv_mean": cv_scores.mean(), "r2_cv_std": cv_scores.std(), "r2_train": ridge.score(X_pca, y), "pca_var_explained": var_explained, "optimal_alpha": ridge.alpha_, "n_images": len(merged) } ``` ## Document Image Analysis ### The Vietnamese Filing Problem A substantial fraction of Vietnamese corporate disclosures (e.g., annual reports, financial statements, board resolutions, shareholder meeting minutes) are distributed as scanned PDF images rather than machine-readable text. This creates a data extraction bottleneck: the information exists but is trapped in pixel format. Unlike filings in more developed markets (where XBRL mandates ensure machine readability), Vietnamese filings require Optical Character Recognition (OCR) and layout analysis before any quantitative analysis can begin. The document AI pipeline for Vietnamese financial filings involves four stages: 1. **Page classification**: Identify which pages contain financial statements, management discussion, audit opinions, etc. 2. **Layout analysis**: Detect the spatial structure such as headers, paragraphs, tables, figures, captions. 3. **OCR**: Convert image regions to text, using Vietnamese-optimized models. 4. **Structured extraction**: Parse the recognized text into structured data (e.g., revenue figures, balance sheet items). ### OCR for Vietnamese Financial Documents Standard OCR engines (Tesseract, Google Cloud Vision) struggle with Vietnamese financial documents due to the combination of Vietnamese diacritics (ă, ơ, ư, ê, etc.), mixed Vietnamese-English content, and complex table layouts. We implement a pipeline using PaddleOCR (which has strong CJK and Southeast Asian language support) and VietOCR (a Vietnamese-specific model based on the transformer architecture of @baek2019wrong). ```{python} #| label: ocr-pipeline #| eval: false def ocr_financial_document(pdf_path, language="vi", engine="paddleocr"): """ OCR a Vietnamese financial document (scanned PDF). Parameters ---------- pdf_path : str Path to PDF file. language : str Language code. engine : str 'paddleocr' or 'vietocr'. Returns ------- list[dict] : Per-page OCR results with bounding boxes. """ from pdf2image import convert_from_path # Convert PDF pages to images pages = convert_from_path(pdf_path, dpi=300) results = [] if engine == "paddleocr": from paddleocr import PaddleOCR ocr = PaddleOCR(use_angle_cls=True, lang="vi", use_gpu=False) for page_num, page_img in enumerate(pages): # Convert PIL to numpy img_array = np.array(page_img) ocr_result = ocr.ocr(img_array, cls=True) page_texts = [] for line in ocr_result[0]: bbox, (text, confidence) = line page_texts.append({ "text": text, "confidence": confidence, "bbox": bbox, "page": page_num + 1 }) results.extend(page_texts) return results def classify_page_type(ocr_results, page_num): """ Classify a document page by content type using keyword matching. Returns one of: 'balance_sheet', 'income_statement', 'cash_flow', 'notes', 'audit', 'management', 'other'. """ page_text = " ".join( [r["text"] for r in ocr_results if r["page"] == page_num] ).lower() # Vietnamese financial statement keywords keyword_map = { "balance_sheet": [ "bảng cân đối kế toán", "tài sản", "nguồn vốn", "nợ phải trả", "vốn chủ sở hữu" ], "income_statement": [ "kết quả hoạt động kinh doanh", "doanh thu", "lợi nhuận", "chi phí", "thu nhập" ], "cash_flow": [ "lưu chuyển tiền tệ", "dòng tiền", "hoạt động kinh doanh", "hoạt động đầu tư" ], "audit": [ "báo cáo kiểm toán", "kiểm toán viên", "ý kiến kiểm toán", "trung thực và hợp lý" ], "management": [ "ban giám đốc", "hội đồng quản trị", "báo cáo thường niên", "tình hình hoạt động" ] } scores = {} for page_type, keywords in keyword_map.items(): scores[page_type] = sum( 1 for kw in keywords if kw in page_text ) if max(scores.values()) == 0: return "other" return max(scores, key=scores.get) ``` ### Table Extraction from Financial Statements The highest-value extraction task is recovering structured tables from financial statements. We implement a two-stage approach: first detect table regions using a layout analysis model, then parse the detected regions into row-column structure. ```{python} #| label: table-extraction #| eval: false def extract_tables_from_page(page_image, ocr_results, page_num): """ Extract structured tables from a document page. Uses spatial clustering of OCR bounding boxes to identify table regions, then aligns text into rows and columns. Parameters ---------- page_image : PIL.Image Page image. ocr_results : list[dict] OCR results for this page. page_num : int Page number. Returns ------- list[pd.DataFrame] : Extracted tables as DataFrames. """ page_texts = [r for r in ocr_results if r["page"] == page_num] if not page_texts: return [] # Extract bounding box centers centers = [] for item in page_texts: bbox = item["bbox"] # bbox is [[x1,y1],[x2,y2],[x3,y3],[x4,y4]] cx = np.mean([p[0] for p in bbox]) cy = np.mean([p[1] for p in bbox]) centers.append((cx, cy, item["text"])) if not centers: return [] centers_df = pd.DataFrame(centers, columns=["x", "y", "text"]) # Cluster into rows by y-coordinate proximity centers_df = centers_df.sort_values("y") row_threshold = 15 # pixels centers_df["row_id"] = ( centers_df["y"].diff().abs() > row_threshold ).cumsum() # Within each row, sort by x-coordinate tables = [] rows = [] for row_id, row_group in centers_df.groupby("row_id"): row_sorted = row_group.sort_values("x") rows.append(row_sorted["text"].tolist()) if len(rows) > 2: # Attempt to construct DataFrame max_cols = max(len(r) for r in rows) # Pad shorter rows padded = [r + [""] * (max_cols - len(r)) for r in rows] try: df = pd.DataFrame(padded[1:], columns=padded[0]) tables.append(df) except Exception: tables.append(pd.DataFrame(padded)) return tables def parse_financial_numbers(text): """ Parse Vietnamese financial number formats. Vietnamese uses dots as thousands separators and commas as decimals. E.g., '1.234.567' = 1234567, '1.234,56' = 1234.56 """ import re text = text.strip().replace(" ", "") # Remove parentheses (negative indicator) negative = text.startswith("(") and text.endswith(")") text = text.strip("()") # Handle Vietnamese number format # If comma is present, it's a decimal separator if "," in text: text = text.replace(".", "").replace(",", ".") else: text = text.replace(".", "") try: value = float(text) return -value if negative else value except ValueError: return np.nan ``` ### Layout-Aware Document Understanding Modern document AI goes beyond OCR by jointly modeling text content and spatial layout. LayoutLM [@huang2022layoutlmv3] and its successors treat each token as having both a text embedding and a positional embedding derived from its bounding box coordinates. This allows the model to understand that a number positioned below a "Revenue" header and to the right of "2023" is the 2023 revenue figure, even without explicit table detection. ```{python} #| label: layoutlm-extraction #| eval: false def layoutlm_extract(document_pages, model_name="layoutlmv3"): """ Extract structured financial data using LayoutLM. This function uses the pre-trained LayoutLMv3 model for document understanding with Vietnamese financial statements. Parameters ---------- document_pages : list List of (page_image, ocr_results) tuples. model_name : str Model variant. Returns ------- dict : Extracted financial fields. """ from transformers import ( LayoutLMv3ForTokenClassification, LayoutLMv3Processor ) processor = LayoutLMv3Processor.from_pretrained( "microsoft/layoutlmv3-base", apply_ocr=False # We provide our own OCR ) model = LayoutLMv3ForTokenClassification.from_pretrained( "microsoft/layoutlmv3-base", num_labels=13 # Financial statement field types ) # Define target fields for extraction field_labels = [ "O", # Other "B-REVENUE", "I-REVENUE", "B-COGS", "I-COGS", "B-NET_INCOME", "I-NET_INCOME", "B-TOTAL_ASSETS", "I-TOTAL_ASSETS", "B-TOTAL_EQUITY", "I-TOTAL_EQUITY", "B-TOTAL_DEBT", "I-TOTAL_DEBT" ] extracted = {} for page_img, ocr_results in document_pages: words = [r["text"] for r in ocr_results] boxes = [] for r in ocr_results: bbox = r["bbox"] # Normalize to 0-1000 range x0 = min(p[0] for p in bbox) y0 = min(p[1] for p in bbox) x1 = max(p[0] for p in bbox) y1 = max(p[1] for p in bbox) boxes.append([int(x0), int(y0), int(x1), int(y1)]) if not words: continue # Process through LayoutLM encoding = processor( page_img, words, boxes=boxes, return_tensors="pt", truncation=True, max_length=512 ) with torch.no_grad(): outputs = model(**encoding) predictions = outputs.logits.argmax(-1).squeeze().tolist() # Extract labeled entities for idx, pred in enumerate(predictions): if pred > 0 and idx < len(words): label = field_labels[pred] if label.startswith("B-"): field = label[2:] value = parse_financial_numbers(words[idx]) if not np.isnan(value): extracted[field] = value return extracted ``` ## Chart and Figure Digitization ### Motivation: Unlocking Visual Financial Data Financial charts (e.g., price time series, bar charts of earnings, scatter plots of risk-return tradeoffs) embed information that analysts process visually. For systematic strategies, this information must be converted to numerical form. Three use cases motivate chart digitization: 1. **Historical data recovery.** Pre-digital financial data often exists only in printed charts. Digitizing these charts extends historical time series beyond the electronic era. 2. **Broker report extraction.** Sell-side research reports contain charts with projections and scenario analyses. Extracting these programmatically enables systematic aggregation of analyst views. 3. **Regulatory filings.** Vietnamese regulatory filings sometimes embed data as images (charts, scanned tables) rather than as machine-readable values. ### Chart Type Classification The first step is classifying the chart type (line, bar, scatter, pie, candlestick), which determines the appropriate digitization algorithm. ```{python} #| label: chart-classification #| eval: false def build_chart_classifier(n_classes=5): """ Build a CNN-based chart type classifier. Classes: line_chart, bar_chart, scatter_plot, candlestick, pie_chart. """ model = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1) # Replace final layer for chart classification model.fc = nn.Sequential( nn.Dropout(0.3), nn.Linear(512, n_classes) ) return model def classify_chart(image_path, model, transform): """Classify a chart image into one of 5 types.""" class_names = [ "line_chart", "bar_chart", "scatter_plot", "candlestick", "pie_chart" ] img = Image.open(image_path).convert("RGB") tensor = transform(img).unsqueeze(0) with torch.no_grad(): logits = model(tensor) probs = torch.softmax(logits, dim=1).squeeze() pred_idx = probs.argmax().item() return { "predicted_class": class_names[pred_idx], "confidence": probs[pred_idx].item(), "all_probs": { name: probs[i].item() for i, name in enumerate(class_names) } } ``` ### Line Chart Digitization For line charts, the digitization task is to recover the $(x, y)$ data series from the image. The pipeline involves axis detection, scale calibration, and curve tracing. ```{python} #| label: line-chart-digitizer #| eval: false def digitize_line_chart(image_path, x_range=None, y_range=None): """ Digitize a line chart image to recover the data series. Parameters ---------- image_path : str Path to chart image. x_range : tuple, optional (x_min, x_max) if known. y_range : tuple, optional (y_min, y_max) if known. Returns ------- DataFrame : Digitized data points (x, y). """ import cv2 img = cv2.imread(str(image_path)) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) h, w = gray.shape # Step 1: Detect plot area (largest rectangular region) edges = cv2.Canny(gray, 50, 150) contours, _ = cv2.findContours( edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE ) if contours: largest = max(contours, key=cv2.contourArea) x_start, y_start, plot_w, plot_h = cv2.boundingRect(largest) else: # Fallback: assume plot is central 80% of image x_start, y_start = int(w * 0.1), int(h * 0.1) plot_w, plot_h = int(w * 0.8), int(h * 0.8) # Step 2: Extract line pixels within plot area # Convert to HSV and isolate colored lines hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV) plot_region = hsv[y_start:y_start + plot_h, x_start:x_start + plot_w] # Detect non-white, non-gray pixels (likely the line) saturation = plot_region[:, :, 1] line_mask = saturation > 30 # Colored pixels # Step 3: Trace the line (column-wise median of colored pixels) data_points = [] for col in range(plot_w): col_pixels = np.where(line_mask[:, col])[0] if len(col_pixels) > 0: # Use median y-position y_pixel = np.median(col_pixels) # Convert pixel to data coordinates x_frac = col / plot_w y_frac = 1 - y_pixel / plot_h # Invert y-axis x_val = (x_range[0] + x_frac * (x_range[1] - x_range[0]) if x_range else x_frac) y_val = (y_range[0] + y_frac * (y_range[1] - y_range[0]) if y_range else y_frac) data_points.append({"x": x_val, "y": y_val}) return pd.DataFrame(data_points) ``` ## Visual Sentiment Analysis ### Image Sentiment in Financial News News articles are accompanied by images that carry sentiment independent of the text. A photograph of a CEO smiling at a press conference conveys different information than the same CEO facing protesters. @obaid2022picture demonstrate that the visual sentiment of Wall Street Journal photographs predicts market returns: days with more negative imagery precede lower returns. We implement visual sentiment analysis using two approaches: a pre-trained sentiment classifier and a vision-language model that interprets images in financial context. ### CNN-Based Visual Sentiment ```{python} #| label: visual-sentiment #| eval: false def compute_visual_sentiment(image_paths, model_name="resnet50"): """ Compute visual sentiment scores using a fine-tuned CNN. Uses features from a pre-trained CNN followed by a sentiment classifier trained on the Visual Sentiment Ontology (VSO) or similar dataset. Parameters ---------- image_paths : list Paths to news images. Returns ------- DataFrame : image_path, positive_score, negative_score, sentiment. """ device = "cuda" if torch.cuda.is_available() else "cpu" model, transform, dim = build_feature_extractor(model_name, device) # Extract features features = extract_features(image_paths, model, transform, device) # Simple sentiment model: use mean activation as proxy # (In practice, fine-tune on labeled financial images) # Higher mean activation in certain feature channels # correlates with positive/negative affect # Positive channels (empirically determined via validation) pos_channels = list(range(0, dim // 3)) neg_channels = list(range(dim // 3, 2 * dim // 3)) pos_scores = features[:, pos_channels].mean(axis=1) neg_scores = features[:, neg_channels].mean(axis=1) # Normalize to [0, 1] pos_norm = (pos_scores - pos_scores.min()) / ( pos_scores.max() - pos_scores.min() + 1e-8 ) neg_norm = (neg_scores - neg_scores.min()) / ( neg_scores.max() - neg_scores.min() + 1e-8 ) sentiment = pos_norm - neg_norm return pd.DataFrame({ "image_path": image_paths, "positive_score": pos_norm, "negative_score": neg_norm, "net_sentiment": sentiment }) ``` ### Vision-Language Models for Financial Image Understanding The most powerful approach to financial image analysis uses vision-language models (VLMs), which jointly process images and text. Models such as CLIP [@radford2021learning], BLIP-2 [@li2023blip], and GPT-4V can be prompted to interpret financial images in context. For instance, given an aerial photograph of a factory, a VLM can answer "Is this factory operating at full capacity?" or "Is there visible construction of additional facilities?" ```{python} #| label: vlm-analysis #| eval: false def vlm_financial_analysis(image_path, prompt, model_name="clip"): """ Use a vision-language model to analyze a financial image. Parameters ---------- image_path : str Path to image. prompt : str Financial analysis prompt. model_name : str 'clip' for zero-shot classification, 'blip2' for visual question answering. Returns ------- dict : Model output (scores or text). """ img = Image.open(image_path).convert("RGB") if model_name == "clip": from transformers import CLIPProcessor, CLIPModel clip_model = CLIPModel.from_pretrained( "openai/clip-vit-base-patch32" ) processor = CLIPProcessor.from_pretrained( "openai/clip-vit-base-patch32" ) # Zero-shot classification with financial labels labels = [ "busy commercial area with many customers", "empty commercial area with few customers", "active construction site with workers", "idle construction site without activity", "healthy green crops in agricultural field", "damaged or dry crops in agricultural field", "busy port with many ships and containers", "quiet port with few ships" ] inputs = processor( text=labels, images=img, return_tensors="pt", padding=True ) with torch.no_grad(): outputs = clip_model(**inputs) logits = outputs.logits_per_image.squeeze() probs = torch.softmax(logits, dim=0) results = { label: prob.item() for label, prob in zip(labels, probs) } return {"scores": results, "top_label": max(results, key=results.get)} elif model_name == "blip2": from transformers import Blip2Processor, Blip2ForConditionalGeneration processor = Blip2Processor.from_pretrained( "Salesforce/blip2-opt-2.7b" ) model = Blip2ForConditionalGeneration.from_pretrained( "Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16 ) inputs = processor(images=img, text=prompt, return_tensors="pt") with torch.no_grad(): generated_ids = model.generate(**inputs, max_length=100) answer = processor.decode( generated_ids[0], skip_special_tokens=True ) return {"answer": answer} ``` ```{python} #| label: visual-sentiment-market #| eval: false # Construct daily visual sentiment index from news images # Source: Vietnamese financial news sites (VnExpress, CafeF, etc.) news_images = dc.get_news_images( start_date="2018-01-01", end_date="2024-12-31", source=["vnexpress_finance", "cafef"] ) # Aggregate daily visual sentiment daily_sentiment = ( news_images.groupby("date") .agg( visual_sentiment=("net_sentiment", "mean"), n_images=("net_sentiment", "count"), pct_negative=("net_sentiment", lambda x: (x < 0).mean()) ) .reset_index() ) # Merge with market returns market_returns = dc.get_market_returns( start_date="2018-01-01", end_date="2024-12-31", frequency="daily" ) sentiment_returns = daily_sentiment.merge( market_returns[["date", "mkt_ret"]], on="date", how="inner" ) # Lead-lag analysis: does visual sentiment predict next-day returns? sentiment_returns = sentiment_returns.sort_values("date") sentiment_returns["mkt_ret_lead1"] = sentiment_returns["mkt_ret"].shift(-1) ``` ```{python} #| label: tbl-visual-sentiment-predictability #| eval: false #| tbl-cap: "Visual Sentiment and Market Return Predictability" # Regression: next-day return on today's visual sentiment sr_clean = sentiment_returns.dropna( subset=["mkt_ret_lead1", "visual_sentiment", "mkt_ret"] ) model_sent = sm.OLS( sr_clean["mkt_ret_lead1"], sm.add_constant(sr_clean[["visual_sentiment", "mkt_ret"]]) ).fit(cov_type="HAC", cov_kwds={"maxlags": 5}) sent_results = pd.DataFrame({ "Coefficient": model_sent.params.round(6), "Std Error": model_sent.bse.round(6), "t-stat": model_sent.tvalues.round(3), "p-value": model_sent.pvalues.round(4) }) sent_results ``` ## Multimodal Fusion: Combining Image and Text ### Why Multimodal? Text and images capture different dimensions of the same underlying economic reality. An earnings report describes financial performance in words and numbers; the accompanying photographs show factories, products, and management. A news article about a port describes trade volumes in text; the satellite image shows actual ship positions. Combining both modalities yields a richer representation than either alone. The fusion architecture depends on the application: **Early fusion.** Concatenate image features $\mathbf{z}^{\text{img}}$ and text features $\mathbf{z}^{\text{txt}}$ into a single vector $[\mathbf{z}^{\text{img}}; \mathbf{z}^{\text{txt}}]$ before prediction. Simple but ignores cross-modal interactions. **Late fusion.** Train separate models on each modality and combine predictions: $\hat{y} = \alpha \hat{y}^{\text{img}} + (1-\alpha) \hat{y}^{\text{txt}}$. Robust but cannot learn cross-modal features. **Cross-attention fusion.** Use transformer cross-attention to let each modality attend to the other. Most powerful but requires more data and computation. $$ \mathbf{z}^{\text{fused}} = \text{CrossAttention}(\mathbf{z}^{\text{img}}, \mathbf{z}^{\text{txt}}) = \text{softmax}\left(\frac{\mathbf{Q}^{\text{img}} (\mathbf{K}^{\text{txt}})^\top}{\sqrt{d}}\right) \mathbf{V}^{\text{txt}} $$ {#eq-cross-attention} ```{python} #| label: multimodal-fusion #| eval: false class MultimodalFusionModel(nn.Module): """ Multimodal fusion model combining image and text features for financial prediction. Supports early fusion, late fusion, and cross-attention. """ def __init__(self, img_dim=2048, txt_dim=768, hidden_dim=256, fusion="early", n_heads=4): super().__init__() self.fusion = fusion # Image projection self.img_proj = nn.Sequential( nn.Linear(img_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.2) ) # Text projection self.txt_proj = nn.Sequential( nn.Linear(txt_dim, hidden_dim), nn.ReLU(), nn.Dropout(0.2) ) if fusion == "early": self.head = nn.Sequential( nn.Linear(hidden_dim * 2, hidden_dim), nn.ReLU(), nn.Dropout(0.2), nn.Linear(hidden_dim, 1) ) elif fusion == "late": self.img_head = nn.Linear(hidden_dim, 1) self.txt_head = nn.Linear(hidden_dim, 1) self.alpha = nn.Parameter(torch.tensor(0.5)) elif fusion == "cross_attention": self.cross_attn = nn.MultiheadAttention( embed_dim=hidden_dim, num_heads=n_heads, batch_first=True ) self.head = nn.Sequential( nn.Linear(hidden_dim, hidden_dim // 2), nn.ReLU(), nn.Linear(hidden_dim // 2, 1) ) def forward(self, img_features, txt_features): img_h = self.img_proj(img_features) txt_h = self.txt_proj(txt_features) if self.fusion == "early": combined = torch.cat([img_h, txt_h], dim=-1) return self.head(combined).squeeze(-1) elif self.fusion == "late": img_pred = self.img_head(img_h).squeeze(-1) txt_pred = self.txt_head(txt_h).squeeze(-1) alpha = torch.sigmoid(self.alpha) return alpha * img_pred + (1 - alpha) * txt_pred elif self.fusion == "cross_attention": # Image attends to text img_h_unsq = img_h.unsqueeze(1) # (B, 1, D) txt_h_unsq = txt_h.unsqueeze(1) attn_out, _ = self.cross_attn( img_h_unsq, txt_h_unsq, txt_h_unsq ) return self.head(attn_out.squeeze(1)).squeeze(-1) ``` ```{python} #| label: multimodal-experiment #| eval: false def run_multimodal_experiment(image_features, text_features, returns, fusion_types=["early", "late", "cross_attention"]): """ Compare multimodal fusion strategies for return prediction. Parameters ---------- image_features : np.ndarray Image feature matrix (N x img_dim). text_features : np.ndarray Text feature matrix (N x txt_dim). returns : np.ndarray Target returns (N,). fusion_types : list Fusion strategies to compare. Returns ------- DataFrame : R², MSE, Sharpe for each strategy. """ from sklearn.model_selection import TimeSeriesSplit n = len(returns) tscv = TimeSeriesSplit(n_splits=5) results = [] for fusion in fusion_types: fold_r2s = [] for train_idx, test_idx in tscv.split(returns): # Convert to tensors X_img_train = torch.tensor( image_features[train_idx], dtype=torch.float32 ) X_txt_train = torch.tensor( text_features[train_idx], dtype=torch.float32 ) y_train = torch.tensor( returns[train_idx], dtype=torch.float32 ) X_img_test = torch.tensor( image_features[test_idx], dtype=torch.float32 ) X_txt_test = torch.tensor( text_features[test_idx], dtype=torch.float32 ) y_test = returns[test_idx] # Build and train model model = MultimodalFusionModel( img_dim=image_features.shape[1], txt_dim=text_features.shape[1], fusion=fusion ) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) loss_fn = nn.MSELoss() model.train() for epoch in range(50): optimizer.zero_grad() pred = model(X_img_train, X_txt_train) loss = loss_fn(pred, y_train) loss.backward() optimizer.step() # Evaluate model.eval() with torch.no_grad(): y_pred = model(X_img_test, X_txt_test).numpy() ss_res = np.sum((y_test - y_pred) ** 2) ss_tot = np.sum((y_test - y_test.mean()) ** 2) r2 = 1 - ss_res / ss_tot if ss_tot > 0 else 0 fold_r2s.append(r2) results.append({ "fusion": fusion, "r2_mean": np.mean(fold_r2s), "r2_std": np.std(fold_r2s) }) # Add unimodal baselines for modality, features in [("image_only", image_features), ("text_only", text_features)]: from sklearn.linear_model import RidgeCV fold_r2s = [] for train_idx, test_idx in tscv.split(returns): ridge = RidgeCV(alphas=np.logspace(-3, 3, 10)) ridge.fit(features[train_idx], returns[train_idx]) y_pred = ridge.predict(features[test_idx]) y_test = returns[test_idx] ss_res = np.sum((y_test - y_pred) ** 2) ss_tot = np.sum((y_test - y_test.mean()) ** 2) fold_r2s.append(1 - ss_res / ss_tot if ss_tot > 0 else 0) results.append({ "fusion": modality, "r2_mean": np.mean(fold_r2s), "r2_std": np.std(fold_r2s) }) return pd.DataFrame(results) ``` ### Practical Considerations for Vietnamese Markets Multimodal analysis in Vietnamese markets faces several practical considerations: **Data alignment.** Satellite images, news articles, and market data operate on different temporal frequencies and spatial resolutions. Satellite composites are available weekly or biweekly; news is daily; trading is intraday. Proper alignment requires specifying the information set available to an investor at the time of the trading decision to avoid look-ahead bias. **Label scarcity.** Supervised learning requires labeled data (e.g., images annotated with economic outcomes). In Vietnam, ground-truth labels (actual retail sales, actual crop yields, actual port throughput) arrive with significant lags and often lack the granularity to match satellite resolution. Semi-supervised and self-supervised approaches are therefore essential. **Regulatory considerations.** High-resolution satellite imagery of specific commercial or military installations may be restricted. Researchers should verify that their imagery sources comply with Vietnamese regulations on geospatial data. **Computational cost.** Processing satellite tiles through CNNs is computationally intensive. A single Sentinel-2 tile at 10m resolution covering Ho Chi Minh City contains approximately $10{,}980 \times 10{,}980$ pixels per band. Tiling into $224 \times 224$ patches for CNN input generates $\sim 2{,}400$ patches per tile, each requiring a forward pass through the network. | Application | Image Source | Resolution | Frequency | Vietnamese Availability | |---------------|---------------|---------------|---------------|---------------| | Nighttime luminosity | VIIRS/DMSP | 500m | Monthly | Free (NOAA/EOG) | | Crop health | MODIS/Sentinel-2 | 250m/10m | 16-day/5-day | Free (NASA/ESA) | | Port/ship detection | Sentinel-1 (SAR) | 10m | 12-day | Free (ESA Copernicus) | | Construction monitoring | Commercial (Maxar) | 30cm | On demand | Paid (\$) | | Urban density | Sentinel-2 | 10m | 5-day | Free (ESA) | | Document OCR | Corporate filings | N/A | Event-driven | DataCore.vn | | News images | Financial media | N/A | Daily | Web scraping | : Image Data Sources for Vietnamese Financial Applications {#tbl-image-sources} ```{=html}  ``` ## Summary This chapter extended the alternative data toolkit from text (the previous chapter) to images. We demonstrated five distinct application domains for visual data in Vietnamese financial markets. First, satellite and geospatial imagery provides high-frequency, spatially granular economic signals that lead official statistics. Nighttime luminosity serves as a provincial GDP proxy with cross-sectional $R^2$ exceeding 0.7; NDVI crop health indices predict agricultural firm returns; and CNN features extracted from satellite tiles enable rich spatial representations of economic activity. Second, document image analysis solves the practical problem of extracting structured data from Vietnamese financial filings that arrive as scanned images. The pipeline (e.g., OCR with Vietnamese-optimized engines, layout analysis, table extraction, and LayoutLM-based document understanding) converts unstructured pixels into the structured financial data that all downstream analyses require. Third, chart digitization recovers numerical data series from visual representations, extending historical coverage and enabling systematic consumption of analyst outputs. Fourth, visual sentiment analysis from news imagery provides a signal dimension orthogonal to textual sentiment, with potential predictive power for market returns. Fifth, multimodal fusion (combining image and text representations via early, late, or cross-attention architectures) yields richer predictive models than either modality alone. The practical benefit of multimodal approaches scales with the diversity and quality of available data, making it increasingly relevant as Vietnamese alternative data ecosystems mature. The common thread across all applications is the transformation pipeline: raw pixel tensor $\to$ feature representation (via CNN, ViT, or VLM) $\to$ financial signal $\to$ economic interpretation. The choice of architecture and the quality of the domain adaptation determine whether the resulting signal has genuine predictive content or merely captures noise.