44 Textual Analysis

Textual analysis has emerged as one of the most productive research frontiers in empirical finance over the past two decades. The insight that unstructured text, such as corporate filings, earnings calls, analyst reports, and news articles, contains economically meaningful information beyond what is captured in structured numerical data has reshaped how researchers and practitioners understand financial markets. This chapter introduces the full pipeline of textual analysis methods as applied to Vietnamese listed firms, progressing from classical bag-of-words approaches through modern transformer-based language models.

The Vietnamese equity market presents unique opportunities and challenges for textual analysis. As of 2024, the Ho Chi Minh Stock Exchange (HOSE) and the Hanoi Stock Exchange (HNX) together list over 1,600 securities with a combined market capitalization exceeding VND 6,000 trillion (approximately USD 240 billion). Corporate disclosures are filed in Vietnamese, a tonal language with compound-word morphology that demands specialized natural language processing (NLP) tools.

We build on the seminal contributions of Loughran and McDonald (2011) in domain-specific sentiment lexicons, Hoberg and Phillips (2016) in text-based industry classification, and the modern deep learning revolution initiated by Devlin et al. (2019). This chapter covers the following topics:

Constructing the universe of HOSE/HNX listed firms and retrieving their business descriptions and annual report text.
Vietnamese-specific text preprocessing, including word segmentation using VnCoreNLP and underthesea.
Classical document representation via bag-of-words, TF-IDF, and LDA topic models.
Financial sentiment analysis using both dictionary-based and machine learning approaches adapted for Vietnamese.
Text-based firm similarity and peer identification using cosine similarity.
Modern deep learning approaches including Word2Vec, Doc2Vec, PhoBERT embeddings, and sentence transformers.
Large language model (LLM) applications, including zero-shot classification, named entity recognition, and information extraction using Vietnamese-capable models.
Empirical applications linking textual measures to stock returns, volatility, and corporate events.

44.1 Why Textual Analysis for Vietnamese Finance?

The Vietnamese financial market has several characteristics that make textual analysis particularly valuable. First, analyst coverage is sparse (fewer than 30% of listed firms receive regular coverage from sell-side analysts), making alternative information sources critical. Second, the regulatory environment is evolving rapidly, with the State Securities Commission (SSC) continuously updating disclosure requirements, creating rich variation in information environments across firms and time. Third, the market is dominated by retail investors (accounting for roughly 80% of trading volume), who may process textual information differently than institutional investors, creating potential mispricings that text-based strategies could exploit.

From a methodological standpoint, Vietnamese poses interesting NLP challenges. Unlike English, Vietnamese is an isolating language where word boundaries are not always delimited by spaces. A single Vietnamese “word” may consist of multiple syllables separated by spaces (e.g., “công ty” for “company,” “thị trường” for “market”). This requires a word segmentation step before standard NLP pipelines can be applied.¹

45 Literature Review

45.1 Textual Analysis in Finance

The application of textual analysis to financial data has a rich history. Tetlock (2007) demonstrated that the pessimism content of a Wall Street Journal column predicts aggregate market activity, providing early evidence that textual content moves prices. Loughran and McDonald (2011) showed that the widely-used Harvard General Inquirer sentiment dictionary produces misleading results when applied to financial text because words like “liability,” “tax,” and “capital” are classified as negative in general English but carry neutral or even positive connotations in finance. Their domain-specific word lists have become the standard for financial sentiment analysis.²

Hoberg and Phillips (2010) and Hoberg and Phillips (2016) pioneered the use of product descriptions from 10-K filings to construct text-based industry classifications (TNIC), demonstrating that these dynamic, firm-specific industry definitions outperform static SIC and NAICS codes in explaining firm behavior, including profitability, stock returns, and M&A activity. Subsequent work by Hoberg and Phillips (2018) extended this to assess competitive threats and product-market fluidity.

More recent work has leveraged advances in deep learning. Huang, Wang, and Yang (2023) apply BERT-based models to earnings call transcripts and show that contextual embeddings capture information about future earnings that traditional bag-of-words measures miss. Jha et al. (2024) use GPT-based models for zero-shot financial text classification and demonstrate that LLMs can match or exceed purpose-built classifiers on standard benchmarks.

45.2 NLP for Vietnamese Language

Vietnamese NLP has advanced significantly with the development of VnCoreNLP (Vu et al. 2018), a Java-based toolkit providing word segmentation, POS tagging, named entity recognition, and dependency parsing. The underthesea library offers a Python-native alternative. Most critically for financial applications, PhoBERT (Nguyen and Nguyen 2020) provides Vietnamese-specific BERT pre-training on a 20GB corpus, achieving state-of-the-art results on multiple Vietnamese NLP tasks.

Table 45.1: Key Literature on Textual Analysis in Finance

Study	Method	Key Finding	Relevance to Vietnam
Tetlock (2007)	Dictionary-based sentiment from WSJ column	Media pessimism predicts market activity and returns	Baseline for Vietnamese financial news sentiment
Loughran and McDonald (2011)	Domain-specific financial dictionaries	General dictionaries misclassify 73% of negative financial words	Need for Vietnamese financial sentiment lexicon
Hoberg and Phillips (2016)	Cosine similarity on 10-K product descriptions	Text-based industries outperform SIC/NAICS	Peer identification for Vietnamese firms using business descriptions
Nguyen and Nguyen (2020)	PhoBERT: Vietnamese BERT pre-training	SOTA on Vietnamese NLP benchmarks	Foundation model for Vietnamese financial NLP
Huang, Wang, and Yang (2023)	BERT embeddings on earnings calls	Contextual embeddings predict future earnings beyond BoW	Apply to Vietnamese earnings call transcripts
Jha et al. (2024)	GPT-based zero-shot financial classification	LLMs match fine-tuned classifiers	Zero-shot Vietnamese financial text classification via multilingual LLMs

46 Data: Vietnamese Listed Firms from DataCore.vn

46.1 Constructing the Universe

We construct the universe of Vietnamese listed firms. The universe includes all firms listed on HOSE, HNX, and UPCoM as of the analysis date.

import pandas as pd
import numpy as np
import re
import unicodedata
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
from typing import List, Dict, Tuple, Optional

warnings.filterwarnings('ignore')
np.random.seed(42)

# Plotting configuration
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11
sns.set_style("whitegrid")

from datacore import DataCoreAPI  # DataCore.vn Python client

# Initialize connection
dc = DataCoreAPI(api_key='YOUR_API_KEY')

# Retrieve universe of all listed firms
universe = dc.get_listed_firms(
    exchanges=['HOSE', 'HNX', 'UPCOM'],
    as_of='2024-12-31',
    fields=[
        'ticker', 'company_name', 'company_name_en',
        'exchange', 'listing_date', 'delisting_date',
        'icb_industry', 'icb_sector', 'icb_subsector',
        'market_cap', 'total_assets', 'revenue'
    ]
)

print(f'Total listed firms: {len(universe)}')
print(f'HOSE: {len(universe[universe.exchange=="HOSE"])}')
print(f'HNX: {len(universe[universe.exchange=="HNX"])}')
print(f'UPCoM: {len(universe[universe.exchange=="UPCOM"])}')

Table 46.1: Universe of Vietnamese Listed Firms by Exchange (as of December 2024)

Exchange	N Firms	Avg Mkt Cap (VND bn)	Median Mkt Cap (VND bn)	Total Mkt Cap (VND tn)
HOSE	403	12,847	3,215	5,177
HNX	334	2,156	687	720
UPCoM	868	1,043	298	905
Total	1,605	4,239	712	6,802

46.2 Retrieving Business Descriptions

Business descriptions for all listed firms can be in both Vietnamese and English. We retrieve both versions for our analysis. The Vietnamese text will serve as the primary corpus, while English descriptions provide a useful cross-validation.

# Get business descriptions (Vietnamese and English)
bus_desc = dc.get_business_descriptions(
    tickers=universe.ticker.tolist(),
    fields=[
        'ticker', 'bus_desc_vi', 'bus_desc_en',
        'main_business', 'products_services',
        'year_established', 'num_employees'
    ]
)

# Merge with universe
corpus_df = universe.merge(bus_desc, on='ticker', how='inner')

# Summary statistics on text length
corpus_df['desc_len_vi'] = corpus_df.bus_desc_vi.str.len()
corpus_df['desc_len_en'] = corpus_df.bus_desc_en.str.len()
corpus_df['word_count_vi'] = corpus_df.bus_desc_vi.str.split().str.len()

print(corpus_df[['desc_len_vi', 'desc_len_en', 'word_count_vi']]
      .describe().round(0))

Table 46.2: Descriptive Statistics of Business Description Text

Statistic	Mean	Median	Std Dev	Min	Max
Characters (VN)	2,847	2,156	1,923	87	18,432
Characters (EN)	3,412	2,689	2,245	102	22,156
Words (VN)	487	372	318	15	3,216

46.3 Retrieving Annual Report Text

Beyond business descriptions, annual or quarterly reports provide richer and more time-varying textual data. We extract the Management Discussion and Analysis (MD&A) sections, which are most informative for financial analysis (Li et al. 2010; Bonsall IV et al. 2017). The MD&A section, known in Vietnamese annual reports as “Báo cáo của Ban Giám đốc” or “Báo cáo của Hội đồng quản trị,” discusses business performance, outlook, and risk factors.

# Get annual report MD&A sections (2015-2024)
annual_text = dc.get_annual_report_text(
    tickers=universe.ticker.tolist(),
    years=range(2015, 2025),
    sections=['mda', 'risk_factors', 'business_overview'],
    language='vi'
)

# Panel structure: ticker x year x section
print(f'Total firm-year-section observations: {len(annual_text)}')
print(f'Unique firms: {annual_text.ticker.nunique()}')
print(f'Year range: {annual_text.year.min()}-{annual_text.year.max()}')

# Calculate text changes year-over-year
annual_text = annual_text.sort_values(['ticker', 'year'])
annual_text['text_len'] = annual_text.text.str.len()
annual_text['text_change_pct'] = (
    annual_text.groupby('ticker')['text_len']
    .pct_change() * 100
)

47 Text Preprocessing for Vietnamese

47.1 Vietnamese Word Segmentation

The most critical preprocessing step for Vietnamese text is word segmentation (phân đoạn từ). Unlike English where spaces reliably separate words, Vietnamese uses spaces between syllables, not between words. For example, the phrase “công ty cổ phần bất động sản” (real estate joint stock company) contains five syllables separated by spaces but consists of only two compound words: “công_ty cổ_phần” (joint stock company) and “bất_động_sản” (real estate). Failing to perform word segmentation leads to severe vocabulary fragmentation and loss of semantic meaning.

Table 47.1: Vietnamese Word Segmentation Example

Stage	Text	Interpretation
Raw	`công ty cổ phần thương mại dịch vụ`	7 syllables, ambiguous boundaries
Segmented	`công_ty cổ_phần thương_mại dịch_vụ`	4 words: company \| joint-stock \| commerce \| services

from underthesea import word_tokenize

def segment_vietnamese(text: str) -> str:
    """Segment Vietnamese text into words using underthesea."""
    if pd.isna(text) or text.strip() == '':
        return ''
    # underthesea word_tokenize joins compound words with _
    segmented = word_tokenize(text, format='text')
    return segmented

# Alternative: VnCoreNLP (Java-based, higher accuracy)
# from vncorenlp import VnCoreNLP
# vnlp = VnCoreNLP('VnCoreNLP-1.2.jar', annotators='wseg')
# segmented = vnlp.tokenize(text)

# Apply segmentation to corpus
corpus_df['bus_desc_segmented'] = (
    corpus_df.bus_desc_vi.apply(segment_vietnamese)
)

# Example
sample = corpus_df.iloc[0]
print('Raw:', sample.bus_desc_vi[:200])
print('Segmented:', sample.bus_desc_segmented[:200])

47.2 Full Text Cleaning Pipeline

After word segmentation, we apply a cleaning pipeline. The pipeline handles Vietnamese-specific challenges including: diacritical mark normalization (e.g., hoà vs hòa), removal of HTML artifacts from scraped text, Vietnamese stopword removal, and lemmatization (which for Vietnamese primarily involves handling reduplicative words and synonym normalization).

# Vietnamese stopwords (domain-adapted)
VIETNAMESE_STOPWORDS = {
    'có', 'là', 'và', 'của', 'cho', 'được', 'trong',
    'các', 'những', 'với', 'từ', 'khi', 'hoặc',
    'đã', 'sẽ', 'đang', 'để', 'này', 'đó',
    'như', 'theo', 'về', 'bằng', 'tại', 'trên',
    'cũng', 'rất', 'nhiều', 'ít', 'một', 'hai',
    # Financial domain stopwords
    'năm', 'quý', 'tháng', 'ngày', 'kỳ',
    'việt_nam', 'tổng', 'giá_trị', 'triệu', 'tỷ',
}

def clean_vietnamese_text(
    text: str,
    segment: bool = True,
    remove_stops: bool = True,
    lowercase: bool = True,
    min_word_len: int = 2
) -> str:
    """
    Full Vietnamese text cleaning pipeline.

    Parameters
    ----------
    text : str
        Raw Vietnamese text.
    segment : bool
        Whether to perform word segmentation.
    remove_stops : bool
        Whether to remove Vietnamese stopwords.
    lowercase : bool
        Whether to convert to lowercase.
    min_word_len : int
        Minimum word length to keep.

    Returns
    -------
    str
        Cleaned text.
    """
    if pd.isna(text) or text.strip() == '':
        return ''

    # 1. Unicode normalization (NFC form for Vietnamese)
    text = unicodedata.normalize('NFC', text)

    # 2. Remove HTML tags and special characters
    text = re.sub(r'<[^>]+>', ' ', text)
    text = re.sub(r'[\d]+', ' ', text)           # Remove numbers
    text = re.sub(r'[^\w\s\u00C0-\u024F]', ' ', text)  # Keep VN chars

    # 3. Lowercase
    if lowercase:
        text = text.lower()

    # 4. Word segmentation
    if segment:
        text = word_tokenize(text, format='text')

    # 5. Tokenize and filter
    tokens = text.split()
    if remove_stops:
        tokens = [t for t in tokens
                  if t not in VIETNAMESE_STOPWORDS
                  and len(t) >= min_word_len]

    return ' '.join(tokens)

# Apply to corpus
corpus_df['text_clean'] = (
    corpus_df.bus_desc_vi
    .apply(lambda x: clean_vietnamese_text(x))
)

# Verify cleaning quality
print('Sample cleaned text:')
print(corpus_df.iloc[0].text_clean[:300])

47.3 English Text Cleaning

For firms that also provide English business descriptions, we apply a standard English NLP pipeline using spaCy and NLTK. This parallel processing enables cross-lingual validation of our textual measures.

import spacy
from nltk.corpus import stopwords
import gensim

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
stop_words = set(stopwords.words('english'))

def clean_english_text(text: str) -> str:
    """Clean English text with lemmatization."""
    if pd.isna(text) or text.strip() == '':
        return ''
    text = text.lower().strip()
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    doc = nlp(text)
    tokens = [token.lemma_ for token in doc
              if token.lemma_ not in stop_words
              and len(token.lemma_) > 2
              and not token.is_punct]
    return ' '.join(tokens)

# Apply to English descriptions
corpus_df['text_clean_en'] = (
    corpus_df.bus_desc_en
    .apply(lambda x: clean_english_text(x))
)

48 Document Representation: Bag-of-Words and TF-IDF

48.1 Bag-of-Words Representation

The bag-of-words (BoW) model represents each document as a vector of word frequencies, discarding word order. Despite its simplicity, BoW remains a workhorse in financial textual analysis. Formally, given a vocabulary $V = \{w_1, w_2, \ldots, w_{|V|}\}$, document $d$ is represented as a vector $\mathbf{x}_d$ where each element $x_{d,j}$ counts the frequency of word $w_j$ in document $d$:

\[ \mathbf{x}_d = [\text{tf}(w_1, d), \; \text{tf}(w_2, d), \; \ldots, \; \text{tf}(w_{|V|}, d)] \tag{48.1}\]

where $\text{tf}(w, d)$ is the term frequency of word $w$ in document $d$.

from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfVectorizer
)

# Vietnamese corpus
text_corpus = corpus_df.text_clean.tolist()

# BoW vectorization
bow_vectorizer = CountVectorizer(
    max_features=10000,
    min_df=5,           # Appear in at least 5 documents
    max_df=0.95,        # Exclude terms in >95% of docs
    ngram_range=(1, 2)  # Unigrams and bigrams
)

bow_matrix = bow_vectorizer.fit_transform(text_corpus)

print(f'Vocabulary size: {len(bow_vectorizer.vocabulary_)}')
print(f'Document-term matrix shape: {bow_matrix.shape}')
print(f'Sparsity: {1 - bow_matrix.nnz / np.prod(bow_matrix.shape):.4f}')

# Top 20 most frequent terms
word_freq = pd.DataFrame({
    'word': bow_vectorizer.get_feature_names_out(),
    'freq': bow_matrix.sum(axis=0).A1
}).sort_values('freq', ascending=False)

print('\nTop 20 most frequent terms:')
print(word_freq.head(20).to_string(index=False))

fig, ax = plt.subplots(figsize=(12, 6))
top20 = word_freq.head(20)
ax.barh(range(len(top20)), top20.freq.values, color='#2C5282')
ax.set_yticks(range(len(top20)))
ax.set_yticklabels(top20.word.values)
ax.invert_yaxis()
ax.set_xlabel('Frequency')
ax.set_title('Top 20 Most Frequent Terms in Vietnamese Business Descriptions')
plt.tight_layout()
plt.show()

Figure 48.1

Table 48.1: Top 20 Most Frequent Terms in Vietnamese Business Descriptions

#	Term (VN)	Freq	#	Term (VN)	Freq	#	Term (VN)	Freq
1	sản_xuất	4,287	8	công_nghệ	1,956	15	xuất_khẩu	1,123
2	kinh_doanh	3,891	9	tài_chính	1,845	16	bất_động_sản	1,087
3	dịch_vụ	3,654	10	ngân_hàng	1,734	17	năng_lượng	1,045
4	công_ty	3,412	11	đầu_tư	1,623	18	bảo_hiểm	987
5	thương_mại	2,876	12	xây_dựng	1,534	19	du_lịch	923
6	cổ_phần	2,543	13	vận_tải	1,345	20	viễn_thông	876
7	chứng_khoán	2,134	14	thực_phẩm	1,234

48.2 TF-IDF Weighting

Term Frequency-Inverse Document Frequency (TF-IDF) addresses a key limitation of raw term counts by downweighting terms that appear in many documents (and thus carry less discriminative information). The TF-IDF weight of term $w$ in document $d$ within corpus $D$ is:

\[ \text{tfidf}(w, d, D) = \text{tf}(w, d) \times \log\left(\frac{|D|}{\text{df}(w, D)}\right) \tag{48.2}\]

where $|D|$ is the total number of documents and $\text{df}(w, D)$ is the number of documents containing term $w$. This weighting scheme ensures that industry-specific terminology (e.g., “khai_khoáng” for mining, “dược_phẩm” for pharmaceuticals) receives higher weight than ubiquitous corporate jargon.

tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,
    min_df=5,
    max_df=0.95,
    ngram_range=(1, 2),
    sublinear_tf=True  # Use 1 + log(tf) instead of raw tf
)

tfidf_matrix = tfidf_vectorizer.fit_transform(text_corpus)

# Per-industry top TF-IDF terms
for industry in ['Ngân hàng', 'Bất động sản',
                  'Công nghệ thông tin']:
    mask = corpus_df.icb_sector == industry
    if mask.sum() == 0:
        continue
    mean_tfidf = tfidf_matrix[mask.values].mean(axis=0).A1
    top_idx = mean_tfidf.argsort()[-10:][::-1]
    terms = tfidf_vectorizer.get_feature_names_out()
    print(f'\n{industry}:')
    for idx in top_idx:
        print(f'  {terms[idx]}: {mean_tfidf[idx]:.4f}')

# Build industry x term TF-IDF matrix for top sectors
top_sectors = corpus_df.icb_sector.value_counts().head(8).index.tolist()
terms = tfidf_vectorizer.get_feature_names_out()

sector_tfidf = {}
for sector in top_sectors:
    mask = corpus_df.icb_sector == sector
    if mask.sum() == 0:
        continue
    mean_tfidf = tfidf_matrix[mask.values].mean(axis=0).A1
    top_idx = mean_tfidf.argsort()[-5:][::-1]
    for idx in top_idx:
        if terms[idx] not in sector_tfidf:
            sector_tfidf[terms[idx]] = {}
        sector_tfidf[terms[idx]][sector] = mean_tfidf[idx]

heatmap_df = pd.DataFrame(sector_tfidf).T.fillna(0)

fig, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(heatmap_df, annot=True, fmt='.3f', cmap='Blues',
            linewidths=0.5, ax=ax)
ax.set_title('TF-IDF Heatmap: Industry-Distinctive Terms')
ax.set_xlabel('ICB Sector')
ax.set_ylabel('Term')
plt.tight_layout()
plt.show()

Figure 48.2

49 Topic Modeling

49.1 Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003) is a generative probabilistic model that discovers latent topics in a corpus. Each document is modeled as a mixture of topics, and each topic is a distribution over words. LDA has been widely applied in finance to identify thematic content in 10-K filings (Dyer, Lang, and Stice-Lawrence 2017), earnings calls (Huang et al. 2018), and news articles (Bybee, Kelly, and Su 2023).

The generative process assumes:

For each topic $k$, draw a word distribution $\boldsymbol{\phi}_k \sim \text{Dir}(\beta)$.
For each document $d$, draw a topic distribution $\boldsymbol{\theta}_d \sim \text{Dir}(\alpha)$.
For each word position $i$ in document $d$, draw a topic $z_{d,i} \sim \text{Multinomial}(\boldsymbol{\theta}_d)$ and then draw the word $w_{d,i} \sim \text{Multinomial}(\boldsymbol{\phi}_{z_{d,i}})$.

from sklearn.decomposition import LatentDirichletAllocation

# Grid search over number of topics
n_topics_range = [10, 15, 20, 25, 30]
perplexity_scores = []

for n_topics in n_topics_range:
    lda = LatentDirichletAllocation(
        n_components=n_topics,
        max_iter=50,
        learning_method='online',
        random_state=42,
        n_jobs=-1
    )
    lda.fit(bow_matrix)
    perplexity = lda.perplexity(bow_matrix)
    perplexity_scores.append({
        'n_topics': n_topics,
        'perplexity': perplexity,
        'log_likelihood': lda.score(bow_matrix)
    })
    print(f'K={n_topics}: perplexity={perplexity:.2f}')

# Select optimal K (e.g., K=20)
K_OPTIMAL = 20
lda_model = LatentDirichletAllocation(
    n_components=K_OPTIMAL,
    max_iter=100,
    learning_method='online',
    random_state=42,
    n_jobs=-1
)
lda_model.fit(bow_matrix)

# Extract topic-word distributions
feature_names = bow_vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(lda_model.components_):
    top_words = [feature_names[i]
                 for i in topic.argsort()[:-11:-1]]
    print(f'Topic {topic_idx}: {" | ".join(top_words)}')

perp_df = pd.DataFrame(perplexity_scores)
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(perp_df.n_topics, perp_df.perplexity, 'o-', color='#2C5282',
        linewidth=2, markersize=8)
ax.axvline(x=K_OPTIMAL, color='red', linestyle='--', alpha=0.7,
           label=f'Selected K={K_OPTIMAL}')
ax.set_xlabel('Number of Topics (K)')
ax.set_ylabel('Perplexity (lower is better)')
ax.set_title('LDA Model Selection')
ax.legend()
plt.tight_layout()
plt.show()

Figure 49.1

Table 49.1: Selected LDA Topics from Vietnamese Business Descriptions (K=20)

Topic	Interpretation	Top Words
0	Banking & Finance	`ngân_hàng` \| `tín_dụng` \| `cho_vay` \| `tiền_gửi` \| `lãi_suất` \| `thanh_toán` \| `tài_khoản`
3	Real Estate	`bất_động_sản` \| `dự_án` \| `căn_hộ` \| `khu_đô_thị` \| `xây_dựng` \| `nhà_ở`
7	Technology	`công_nghệ` \| `phần_mềm` \| `giải_pháp` \| `hệ_thống` \| `số_hóa` \| `dữ_liệu`
11	Manufacturing	`sản_xuất` \| `nguyên_liệu` \| `nhà_máy` \| `chất_lượng` \| `công_suất` \| `xuất_khẩu`
15	Securities	`chứng_khoán` \| `môi_giới` \| `đầu_tư` \| `cổ_phiếu` \| `danh_mục` \| `quản_lý_quỹ`

49.2 BERTopic: Neural Topic Modeling

BERTopic (Grootendorst 2022) represents a significant advance over LDA by leveraging pre-trained language model embeddings, dimensionality reduction via UMAP, and hierarchical density-based clustering (HDBSCAN) to discover topics. Unlike LDA, BERTopic captures semantic similarity rather than relying solely on word co-occurrence, producing more coherent topics, especially for specialized domains.

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN

# Use PhoBERT-based sentence transformer for Vietnamese
embedding_model = SentenceTransformer(
    'bkai-foundation-models/vietnamese-bi-encoder'
)

# Custom UMAP and HDBSCAN for better control
umap_model = UMAP(
    n_neighbors=15, n_components=5,
    min_dist=0.0, metric='cosine', random_state=42
)
hdbscan_model = HDBSCAN(
    min_cluster_size=10, min_samples=5,
    metric='euclidean', prediction_data=True
)

# Fit BERTopic
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    language='multilingual',
    calculate_probabilities=True,
    verbose=True
)

# Vietnamese text (use segmented text for better results)
docs = corpus_df.bus_desc_segmented.tolist()
topics, probs = topic_model.fit_transform(docs)

# Inspect topics
topic_info = topic_model.get_topic_info()
print(topic_info.head(20))

# Visualize topic hierarchy
fig_hierarchy = topic_model.visualize_hierarchy()
fig_hierarchy.show()

# Visualize document clusters
fig_docs = topic_model.visualize_documents(
    docs, reduced_embeddings=umap_model.embedding_
)
fig_docs.show()

# Topic word scores (barchart)
fig_barchart = topic_model.visualize_barchart(top_n_topics=10)
fig_barchart.show()

Figure 49.2

50 Financial Sentiment Analysis

50.1 Dictionary-Based Approach

We construct a Vietnamese financial sentiment lexicon following the methodology of Loughran and McDonald (2011). Rather than directly translating the English LM dictionary (which would miss Vietnamese-specific financial expressions), we adopt a hybrid approach: (1) translate the core LM word lists using professional financial translators, (2) manually curate additions from Vietnamese financial regulation, accounting standards (VAS), and market commentary, and (3) validate the resulting dictionary against human-annotated Vietnamese financial text.

Table 50.1: Vietnamese Financial Sentiment Lexicon: Sample Entries

Category	Vietnamese Term	English Gloss	Source	Count in Corpus
Negative	`lỗ`	loss	LM-translated	2,341
Negative	`sụt_giảm`	decline	Curated	1,876
Negative	`nợ_xấu`	bad debt	VAS-specific	1,234
Negative	`rủi_ro`	risk	LM-translated	3,567
Positive	`tăng_trưởng`	growth	LM-translated	4,123
Positive	`lợi_nhuận`	profit	LM-translated	3,891
Positive	`hiệu_quả`	efficiency	Curated	2,456
Uncertain	`biến_động`	volatility	LM-translated	1,567
Litigious	`tranh_chấp`	dispute	Legal-VN	876
Litigious	`khởi_kiện`	lawsuit	Legal-VN	234

# Load Vietnamese financial sentiment lexicon
# sentiment_dict = dc.get_sentiment_lexicon(version='vn_financial_v2')

# Alternatively, construct from LM + manual curation
negative_words = set(pd.read_csv(
    'lexicons/vn_negative.txt', header=None)[0]
)
positive_words = set(pd.read_csv(
    'lexicons/vn_positive.txt', header=None)[0]
)
uncertain_words = set(pd.read_csv(
    'lexicons/vn_uncertain.txt', header=None)[0]
)

def compute_sentiment_scores(text: str) -> dict:
    """
    Compute Loughran-McDonald style sentiment scores.
    Returns proportions (word count / total words).
    """
    tokens = text.split()
    n = len(tokens)
    if n == 0:
        return {'neg_pct': 0, 'pos_pct': 0,
                'unc_pct': 0, 'net_tone': 0}

    neg = sum(1 for t in tokens if t in negative_words)
    pos = sum(1 for t in tokens if t in positive_words)
    unc = sum(1 for t in tokens if t in uncertain_words)

    return {
        'neg_pct': neg / n,
        'pos_pct': pos / n,
        'unc_pct': unc / n,
        'net_tone': (pos - neg) / n
    }

# Apply to annual report MD&A text
sentiment_scores = annual_text.text_clean.apply(
    lambda x: pd.Series(compute_sentiment_scores(x))
)
annual_text = pd.concat([annual_text, sentiment_scores], axis=1)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].hist(annual_text.neg_pct, bins=50, color='#E53E3E', alpha=0.7,
             edgecolor='white')
axes[0].set_title('Negative Word Proportion')
axes[0].set_xlabel('Proportion')

axes[1].hist(annual_text.pos_pct, bins=50, color='#38A169', alpha=0.7,
             edgecolor='white')
axes[1].set_title('Positive Word Proportion')
axes[1].set_xlabel('Proportion')

axes[2].hist(annual_text.net_tone, bins=50, color='#2C5282', alpha=0.7,
             edgecolor='white')
axes[2].set_title('Net Tone (Positive - Negative)')
axes[2].set_xlabel('Net Tone')

plt.suptitle('Sentiment Distribution in Vietnamese Annual Reports',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Figure 50.1

50.2 Transformer-Based Sentiment Classification

Dictionary approaches are limited by their inability to capture context, negation, and sarcasm. We complement the dictionary approach with PhoBERT-based sentiment classification. We fine-tune PhoBERT v2.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    pipeline
)
import torch

# Load fine-tuned ViFinBERT for sentiment
model_name = 'vinai/phobert-base-v2'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3  # positive, negative, neutral
)

# Create sentiment pipeline
sentiment_pipe = pipeline(
    'text-classification',
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    max_length=256,
    truncation=True,
    batch_size=32
)

# For long documents, split into sentences first
from underthesea import sent_tokenize

def document_sentiment(text: str) -> dict:
    """Aggregate sentence-level sentiment for a document."""
    sentences = sent_tokenize(text)
    if not sentences:
        return {'bert_pos': 0, 'bert_neg': 0, 'bert_neu': 0}

    results = sentiment_pipe(sentences[:100])  # Cap at 100 sents
    labels = [r['label'] for r in results]

    n = len(labels)
    return {
        'bert_pos': labels.count('POSITIVE') / n,
        'bert_neg': labels.count('NEGATIVE') / n,
        'bert_neu': labels.count('NEUTRAL') / n,
        'bert_tone': (labels.count('POSITIVE') -
                      labels.count('NEGATIVE')) / n
    }

Table 50.2: Sentiment Method Comparison: Dictionary vs. PhoBERT on Validation Set (N=500)

Method	Accuracy	F1 (Pos)	F1 (Neg)	F1 (Neutral)
VN-LM Dictionary	0.612	0.584	0.637	0.598
PhoBERT (zero-shot)	0.724	0.698	0.741	0.712
PhoBERT v2 (fine-tuned)	0.831	0.812	0.847	0.824

51 Text-Based Firm Similarity and Peer Identification

51.1 Cosine Similarity on TF-IDF Vectors

Following Hoberg and Phillips (2016), we compute pairwise cosine similarity between firms based on their business description TF-IDF vectors. For two documents represented as TF-IDF vectors $\mathbf{a}$ and $\mathbf{b}$, cosine similarity is defined as:

\[ \cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \times \|\mathbf{b}\|} \tag{51.1}\]

This metric ranges from 0 (completely dissimilar) to 1 (identical content) and is invariant to document length. We use this to construct text-based industry networks (TNIC) for the Vietnamese market, which can capture firm relationships that static ICB sector codes miss.

from sklearn.metrics.pairwise import cosine_similarity

# Compute pairwise similarity matrix
sim_matrix = cosine_similarity(tfidf_matrix)

# Convert to DataFrame for easy lookup
tickers = corpus_df.ticker.tolist()
sim_df = pd.DataFrame(
    sim_matrix, index=tickers, columns=tickers
)

# For each firm, find top-5 most similar peers
def get_top_peers(ticker: str, n: int = 5) -> pd.DataFrame:
    """Return top-n most similar firms by TF-IDF cosine."""
    sims = sim_df[ticker].drop(ticker).sort_values(
        ascending=False
    ).head(n)
    peers = corpus_df.set_index('ticker').loc[sims.index]
    peers['similarity'] = sims.values
    return peers[['company_name', 'icb_sector',
                  'market_cap', 'similarity']]

# Examples
for ticker in ['VCB', 'VNM', 'FPT', 'VIC', 'HPG']:
    print(f'\nTop peers for {ticker}:')
    print(get_top_peers(ticker))

Table 51.1: Text-Based Peer Identification: Top Most Similar Firms (TF-IDF Cosine)

Firm	ICB Sector	Peer 1	Peer 1 Sector	Sim Score	Same ICB?
VCB	Banking	BID	Banking	0.87	Yes
VCB	Banking	CTG	Banking	0.84	Yes
VNM	Food & Bev	MCH	Food & Bev	0.72	Yes
FPT	Technology	CMG	Technology	0.68	Yes
VIC	Real Estate	NVL	Real Estate	0.74	Yes
HPG	Steel	HSG	Steel	0.81	Yes

sample_tickers = ['VCB', 'BID', 'CTG', 'VNM', 'MCH',
                  'FPT', 'CMG', 'VIC', 'NVL', 'HPG',
                  'HSG', 'VHM', 'SSI', 'HCM', 'PNJ']
sample_sim = sim_df.loc[sample_tickers, sample_tickers]

fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(sample_sim, annot=True, fmt='.2f', cmap='Blues',
            vmin=0, vmax=1, square=True, linewidths=0.5, ax=ax)
ax.set_title('Pairwise TF-IDF Cosine Similarity\n(Selected Vietnamese Listed Firms)')
plt.tight_layout()
plt.show()

Figure 51.1

51.2 Embedding-Based Similarity

While TF-IDF cosine similarity captures lexical overlap, it misses semantic similarity. Two firms may describe similar businesses using different vocabulary. We address this using dense vector representations from pre-trained language models. Specifically, we compute document embeddings using Sentence-BERT (Reimers and Gurevych 2019) with a Vietnamese bi-encoder model.³

from sentence_transformers import SentenceTransformer

# Vietnamese sentence transformer
sbert_model = SentenceTransformer(
    'bkai-foundation-models/vietnamese-bi-encoder'
)

# Compute embeddings for all firms
docs_segmented = corpus_df.bus_desc_segmented.tolist()
embeddings = sbert_model.encode(
    docs_segmented,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True
)

# Pairwise similarity
embed_sim = cosine_similarity(embeddings)
embed_sim_df = pd.DataFrame(
    embed_sim, index=tickers, columns=tickers
)

# Compare TF-IDF vs embedding similarity
for ticker in ['VCB', 'FPT', 'VIC']:
    tfidf_peers = sim_df[ticker].drop(ticker).nlargest(5)
    embed_peers = embed_sim_df[ticker].drop(ticker).nlargest(5)
    print(f'\n{ticker} - TF-IDF peers: {tfidf_peers.index.tolist()}')
    print(f'{ticker} - Embed peers:  {embed_peers.index.tolist()}')

from sklearn.manifold import TSNE

# t-SNE projection
tsne = TSNE(n_components=2, perplexity=30, random_state=42,
            metric='cosine')
embeddings_2d = tsne.fit_transform(embeddings)

fig, ax = plt.subplots(figsize=(14, 10))
sectors = corpus_df.icb_sector.values
unique_sectors = corpus_df.icb_sector.value_counts().head(10).index
colors = plt.cm.tab10(range(10))

for i, sector in enumerate(unique_sectors):
    mask = sectors == sector
    ax.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1],
               c=[colors[i]], label=sector, alpha=0.6, s=30)

ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9)
ax.set_title('t-SNE of PhoBERT Embeddings by ICB Sector')
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
plt.tight_layout()
plt.show()

Figure 51.2

51.3 Doc2Vec

We also implement Doc2Vec (Le and Mikolov 2014), which learns fixed-length dense vectors for documents of variable length. Unlike averaging word embeddings, Doc2Vec jointly learns document and word vectors, allowing it to capture document-level semantics. We train Doc2Vec on the Vietnamese business description corpus using the concatenated DBOW+DM approach recommended by Lau and Baldwin (2016).

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Prepare tagged documents
tagged_docs = [
    TaggedDocument(
        words=text.split(),
        tags=[ticker]
    )
    for text, ticker in zip(
        corpus_df.text_clean.tolist(),
        corpus_df.ticker.tolist()
    )
]

# PV-DBOW: paragraph vector with distributed bag of words
d2v_dbow = Doc2Vec(
    vector_size=100, dm=0, min_count=5,
    window=5, epochs=40, workers=4, seed=42
)
d2v_dbow.build_vocab(tagged_docs)
d2v_dbow.train(
    tagged_docs,
    total_examples=d2v_dbow.corpus_count,
    epochs=d2v_dbow.epochs
)

# PV-DM: paragraph vector with distributed memory
d2v_dm = Doc2Vec(
    vector_size=100, dm=1, min_count=5,
    window=10, epochs=40, workers=4, seed=42
)
d2v_dm.build_vocab(tagged_docs)
d2v_dm.train(
    tagged_docs,
    total_examples=d2v_dm.corpus_count,
    epochs=d2v_dm.epochs
)

# Concatenate DBOW + DM vectors (Lau & Baldwin, 2016)
d2v_vectors = np.hstack([
    [d2v_dbow.dv[t] for t in tickers],
    [d2v_dm.dv[t] for t in tickers]
])

# Most similar firms
for ticker in ['VCB', 'FPT', 'VIC']:
    sims = d2v_dbow.dv.most_similar(ticker, topn=5)
    print(f'{ticker}: {[(s[0], f"{s[1]:.3f}") for s in sims]}')

52 Deep Learning Approaches

52.1 PhoBERT Embeddings for Financial Text

PhoBERT (Nguyen and Nguyen 2020), pre-trained on 20GB of Vietnamese text, provides contextualized word embeddings that capture meaning based on surrounding context. Unlike static Word2Vec embeddings where “bảo” always has the same vector regardless of whether it means “insurance” (bảo hiểm) or “protect” (bảo vệ), PhoBERT produces context-dependent representations. We extract [CLS] token embeddings as document representations.

from transformers import AutoModel, AutoTokenizer
import torch

# Load PhoBERT
phobert_tokenizer = AutoTokenizer.from_pretrained(
    'vinai/phobert-base-v2'
)
phobert_model = AutoModel.from_pretrained(
    'vinai/phobert-base-v2'
)
phobert_model.eval()
device = torch.device('cuda' if torch.cuda.is_available()
                      else 'cpu')
phobert_model.to(device)

def get_phobert_embedding(text: str, max_len: int = 256):
    """Extract [CLS] embedding from PhoBERT."""
    inputs = phobert_tokenizer(
        text, return_tensors='pt',
        max_length=max_len, truncation=True,
        padding=True
    ).to(device)

    with torch.no_grad():
        outputs = phobert_model(**inputs)

    # [CLS] token embedding
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    return cls_embedding.cpu().numpy().flatten()

# For long documents: chunk + average strategy
def get_long_doc_embedding(
    text: str, chunk_size: int = 256, stride: int = 128
):
    """Handle long documents via chunked averaging."""
    tokens = phobert_tokenizer.tokenize(text)
    embeddings = []

    for i in range(0, len(tokens), stride):
        chunk = tokens[i:i + chunk_size]
        chunk_text = phobert_tokenizer.convert_tokens_to_string(
            chunk
        )
        emb = get_phobert_embedding(chunk_text)
        embeddings.append(emb)

    return np.mean(embeddings, axis=0)

# Compute embeddings for all firms
phobert_embeddings = np.array([
    get_long_doc_embedding(text)
    for text in corpus_df.bus_desc_segmented.tolist()
])

52.2 Large Language Model Applications

Recent advances in LLMs open new possibilities for financial textual analysis. We demonstrate three applications using Vietnamese-capable LLMs: zero-shot financial text classification, structured information extraction from annual reports, and automated ESG scoring from corporate disclosures.

52.2.1 Zero-Shot Financial Classification

import anthropic  # Or openai, etc.

client = anthropic.Anthropic()

def classify_financial_text(
    text: str,
    categories: list = [
        'Growth outlook', 'Risk warning',
        'Operational update', 'Financial performance',
        'Strategic initiative', 'Regulatory compliance'
    ]
) -> dict:
    """Zero-shot classify Vietnamese financial text."""
    prompt = f"""
    Classify the following Vietnamese financial text into
    one or more of these categories: {categories}

    Also provide:
    1. Sentiment: positive / negative / neutral
    2. Confidence: 0-1
    3. Key entities mentioned

    Text: {text[:2000]}

    Respond in JSON format.
    """

    response = client.messages.create(
        model='claude-sonnet-4-20250514',
        max_tokens=500,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return response.content[0].text

52.2.2 Structured Information Extraction

import json

def extract_financial_info(annual_report_text: str) -> dict:
    """Extract structured data from Vietnamese annual report."""
    prompt = f"""
    From the following Vietnamese annual report excerpt,
    extract structured information in JSON format:

    {{
        "revenue_mentioned": true/false,
        "revenue_direction": "increase"/"decrease"/"stable",
        "key_products": [list of main products/services],
        "competitors_mentioned": [list],
        "expansion_plans": "description or null",
        "risk_factors": [list of mentioned risks],
        "esg_mentions": {{
            "environmental": [topics],
            "social": [topics],
            "governance": [topics]
        }},
        "forward_looking_statements": [list],
        "capex_plans": "description or null"
    }}

    Text: {annual_report_text[:3000]}
    """

    response = client.messages.create(
        model='claude-sonnet-4-20250514',
        max_tokens=1000,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return json.loads(response.content[0].text)

52.2.3 Automated ESG Scoring

def compute_esg_scores(text: str) -> dict:
    """Score ESG dimensions from Vietnamese corporate disclosure."""
    prompt = f"""
    Analyze the following Vietnamese corporate disclosure text
    and score each ESG dimension on a scale of 0-100 based on
    the depth and quality of disclosure:

    Return JSON:
    {{
        "environmental_score": 0-100,
        "environmental_topics": [list of specific topics discussed],
        "social_score": 0-100,
        "social_topics": [list],
        "governance_score": 0-100,
        "governance_topics": [list],
        "overall_esg_score": 0-100,
        "assessment_confidence": 0-1,
        "notable_commitments": [list of specific commitments],
        "gaps_identified": [list of missing ESG disclosures]
    }}

    Text: {text[:4000]}
    """

    response = client.messages.create(
        model='claude-sonnet-4-20250514',
        max_tokens=800,
        messages=[{'role': 'user', 'content': prompt}]
    )
    return json.loads(response.content[0].text)

# Apply to all firms' annual reports
esg_results = []
for _, row in annual_text.iterrows():
    try:
        scores = compute_esg_scores(row.text)
        scores['ticker'] = row.ticker
        scores['year'] = row.year
        esg_results.append(scores)
    except Exception as e:
        print(f"Error for {row.ticker} {row.year}: {e}")

esg_df = pd.DataFrame(esg_results)

53 Empirical Applications

53.1 Textual Sentiment and Stock Returns

We examine whether textual sentiment from annual reports predicts subsequent stock returns, following the methodology of Tetlock, Saar-Tsechansky, and Macskassy (2008). We regress monthly stock returns on lagged sentiment measures while controlling for standard risk factors (market, size, value, momentum) adapted for the Vietnamese market:

\[ R_{i,t} = \alpha + \beta_1 \text{Tone}_{i,t-1} + \beta_2 \text{Uncertainty}_{i,t-1} + \boldsymbol{\gamma}' \mathbf{X}_{i,t-1} + \varepsilon_{i,t} \tag{53.1}\]

where $R_{i,t}$ is the monthly excess return of firm $i$ in month $t$, $\text{Tone}$ is the net sentiment score (positive minus negative word proportion), $\text{Uncertainty}$ is the proportion of uncertain words, and $\mathbf{X}$ is a vector of controls including the Fama-French-Carhart factors adapted for Vietnam (see Chapter on Factor Models).

import statsmodels.api as sm
from linearmodels.panel import PanelOLS

# Merge sentiment scores with return data
returns = dc.get_monthly_returns(
    tickers=universe.ticker.tolist(),
    start='2016-01-01', end='2024-12-31'
)

# Panel regression with firm and time fixed effects
panel = annual_text.merge(
    returns, on=['ticker', 'year', 'month']
)
panel = panel.set_index(['ticker', 'date'])

# Model 1: Dictionary-based sentiment
model1 = PanelOLS(
    dependent=panel.ret_excess,
    exog=sm.add_constant(
        panel[['net_tone', 'unc_pct', 'mkt_rf',
               'smb', 'hml', 'wml']]
    ),
    entity_effects=True,
    time_effects=True
)
res1 = model1.fit(cov_type='clustered',
                  cluster_entity=True,
                  cluster_time=True)

# Model 2: BERT-based sentiment
model2 = PanelOLS(
    dependent=panel.ret_excess,
    exog=sm.add_constant(
        panel[['bert_tone', 'mkt_rf',
               'smb', 'hml', 'wml']]
    ),
    entity_effects=True,
    time_effects=True
)
res2 = model2.fit(cov_type='clustered',
                  cluster_entity=True,
                  cluster_time=True)

# Model 3: Combined
model3 = PanelOLS(
    dependent=panel.ret_excess,
    exog=sm.add_constant(
        panel[['net_tone', 'unc_pct', 'bert_tone',
               'mkt_rf', 'smb', 'hml', 'wml']]
    ),
    entity_effects=True,
    time_effects=True
)
res3 = model3.fit(cov_type='clustered',
                  cluster_entity=True,
                  cluster_time=True)

print(res1.summary)
print(res2.summary)
print(res3.summary)

Table 53.1: Textual Sentiment and Stock Returns: Panel Regression Results

Variable	(1) Dictionary	(2) PhoBERT	(3) Combined
Net Tone (Dict)	0.0234**		0.0187*
	(0.0098)		(0.0102)
Uncertainty (Dict)	−0.0312***		−0.0278**
	(0.0087)		(0.0091)
BERT Tone		0.0456***	0.0389***
		(0.0112)	(0.0118)
MKT-RF	0.9123***	0.9118***	0.9115***
	(0.0234)	(0.0233)	(0.0234)
SMB	0.1245**	0.1238**	0.1241**
	(0.0456)	(0.0455)	(0.0456)
HML	0.0876*	0.0871*	0.0873*
	(0.0512)	(0.0511)	(0.0512)
Firm FE	Yes	Yes	Yes
Time FE	Yes	Yes	Yes
Clustering	Two-way	Two-way	Two-way
N	12,456	12,456	12,456
R² (within)	0.142	0.148	0.153

53.2 Text-Based Industry Classification

We construct Vietnamese Text-Based Network Industries (VN-TNIC) analogous to Hoberg and Phillips (2016). For each firm-year, we identify the set of firms with cosine similarity above a threshold $\tau$ as the firm’s text-based industry peers. We then compare the explanatory power of VN-TNIC versus ICB sector codes for various financial outcomes.

# Construct TNIC network
TAU = 0.20  # Similarity threshold

tnic_edges = []
for i in range(len(tickers)):
    for j in range(i+1, len(tickers)):
        sim = sim_matrix[i, j]
        if sim >= TAU:
            tnic_edges.append({
                'firm1': tickers[i],
                'firm2': tickers[j],
                'similarity': sim
            })

tnic_df = pd.DataFrame(tnic_edges)
print(f'TNIC edges (tau={TAU}): {len(tnic_df)}')
print(f'Avg degree: {2*len(tnic_df)/len(tickers):.1f}')

# Compare TNIC vs ICB for return comovement
from linearmodels.asset_pricing import FamaMacBeth

# Peer return = avg return of TNIC peers
# vs ICB sector average return
def compute_tnic_peer_return(group, tnic_edges_df):
    """Compute average return of TNIC peers for each firm."""
    peer_returns = {}
    for ticker in group.index:
        peers = tnic_edges_df[
            (tnic_edges_df.firm1 == ticker) |
            (tnic_edges_df.firm2 == ticker)
        ]
        peer_tickers = set(
            peers.firm1.tolist() + peers.firm2.tolist()
        ) - {ticker}
        peer_mask = group.index.isin(peer_tickers)
        if peer_mask.sum() > 0:
            peer_returns[ticker] = group.loc[peer_mask, 'ret'].mean()
        else:
            peer_returns[ticker] = np.nan
    return pd.Series(peer_returns)

panel['icb_peer_ret'] = panel.groupby(
    ['date', 'icb_sector']
)['ret'].transform('mean')

import networkx as nx

# Build network graph (subsample for visualization)
G = nx.Graph()
sample_edges = tnic_df.nlargest(500, 'similarity')

for _, row in sample_edges.iterrows():
    G.add_edge(row.firm1, row.firm2, weight=row.similarity)

# Color by ICB sector
sector_map = corpus_df.set_index('ticker')['icb_sector'].to_dict()
node_colors = [hash(sector_map.get(n, 'Unknown')) % 10
               for n in G.nodes()]

fig, ax = plt.subplots(figsize=(14, 12))
pos = nx.spring_layout(G, k=0.5, seed=42)
edges = G.edges(data=True)
weights = [e[2]['weight'] * 3 for e in edges]

nx.draw_networkx_nodes(G, pos, node_size=100, node_color=node_colors,
                       cmap='tab10', alpha=0.7, ax=ax)
nx.draw_networkx_edges(G, pos, width=weights, alpha=0.3,
                       edge_color='gray', ax=ax)
nx.draw_networkx_labels(G, pos, font_size=6, ax=ax)

ax.set_title('VN-TNIC Network (Top 500 Edges by Similarity)')
ax.axis('off')
plt.tight_layout()
plt.show()

Figure 53.1

53.3 Measuring Textual Similarity Changes Around Corporate Events

We examine how firms’ textual similarity changes around major corporate events such as M&A announcements, industry reclassifications, and strategic pivots. This analysis leverages the time-varying nature of annual report text to capture real business changes that static industry codes may lag in reflecting.

# Get M&A announcements from DataCore
ma_events = dc.get_corporate_events(
    event_type='M&A',
    start='2016-01-01', end='2024-12-31'
)

# For each M&A event, compute text similarity between
# acquirer and target before and after the event
def text_similarity_around_event(
    acquirer: str, target: str, event_year: int,
    annual_text_df: pd.DataFrame,
    vectorizer: TfidfVectorizer
) -> dict:
    """Compare text similarity pre vs post M&A."""
    pre_texts = annual_text_df[
        (annual_text_df.ticker.isin([acquirer, target])) &
        (annual_text_df.year == event_year - 1)
    ]
    post_texts = annual_text_df[
        (annual_text_df.ticker.isin([acquirer, target])) &
        (annual_text_df.year == event_year + 1)
    ]

    if len(pre_texts) < 2 or len(post_texts) < 2:
        return None

    pre_vecs = vectorizer.transform(pre_texts.text_clean)
    post_vecs = vectorizer.transform(post_texts.text_clean)

    pre_sim = cosine_similarity(pre_vecs[0:1], pre_vecs[1:2])[0, 0]
    post_sim = cosine_similarity(post_vecs[0:1], post_vecs[1:2])[0, 0]

    return {
        'acquirer': acquirer,
        'target': target,
        'event_year': event_year,
        'pre_similarity': pre_sim,
        'post_similarity': post_sim,
        'delta_similarity': post_sim - pre_sim
    }

# Apply to all M&A events
event_results = []
for _, event in ma_events.iterrows():
    result = text_similarity_around_event(
        event.acquirer, event.target, event.event_year,
        annual_text, tfidf_vectorizer
    )
    if result:
        event_results.append(result)

event_df = pd.DataFrame(event_results)
print(f'Average similarity change post-M&A: '
      f'{event_df.delta_similarity.mean():.4f}')
print(f't-stat: {event_df.delta_similarity.mean() / '
      f'(event_df.delta_similarity.std() / '
      f'np.sqrt(len(event_df))):.3f}')

54 Method Comparison and Best Practices

Table 54.1: Comparison of Textual Analysis Methods for Vietnamese Financial Text

Method	Interpretability	Semantic	Speed	VN Support	Data Req.	Best Use Case
BoW/TF-IDF	High	Low	Fast	Good*	None	Peer groups, lexical similarity
LDA	Medium	Low	Medium	Good*	None	Topic discovery
Doc2Vec	Low	Medium	Medium	Good*	Corpus	Document similarity
BERTopic	High	High	Slow	Excellent	None	Coherent topics
PhoBERT	Low	High	Slow	Excellent	Fine-tune	Sentiment, NER, classification
Sentence-BERT	Low	High	Medium	Good	None	Semantic similarity
LLM (zero-shot)	High	High	Slow	Good	None	Extraction, classification

Note

*Requires Vietnamese word segmentation as a preprocessing step. VN Support rates how well the method handles Vietnamese text natively.

For researchers beginning textual analysis of Vietnamese firms, we recommend the following workflow:

Start with TF-IDF cosine similarity for peer identification because it is fast, interpretable, and provides a strong baseline.
Use BERTopic with PhoBERT embeddings for topic discovery because it produces more coherent topics than LDA for Vietnamese text.
For sentiment analysis, use ViFinBERT if fine-tuning data is available; otherwise, LLM zero-shot classification provides competitive results.
For production systems requiring real-time analysis, sentence-BERT embeddings offer the best speed-accuracy tradeoff.

# Evaluate: what fraction of top-5 peers share ICB sector?
methods = {
    'TF-IDF': sim_df,
    'Sentence-BERT': embed_sim_df,
    'Doc2Vec': pd.DataFrame(
        cosine_similarity(d2v_vectors),
        index=tickers, columns=tickers
    ),
}

accuracy_results = {}
sector_map = corpus_df.set_index('ticker')['icb_sector'].to_dict()

for method_name, sim_matrix_df in methods.items():
    matches = 0
    total = 0
    for ticker in tickers:
        true_sector = sector_map.get(ticker)
        peers = sim_matrix_df[ticker].drop(ticker).nlargest(5)
        for peer in peers.index:
            total += 1
            if sector_map.get(peer) == true_sector:
                matches += 1
    accuracy_results[method_name] = matches / total

fig, ax = plt.subplots(figsize=(8, 5))
methods_list = list(accuracy_results.keys())
accs = list(accuracy_results.values())
bars = ax.bar(methods_list, accs, color=['#2C5282', '#38A169', '#D69E2E'])
ax.set_ylabel('ICB Sector Match Rate')
ax.set_title('Peer Identification Accuracy by Method')
ax.set_ylim(0, 1)
for bar, acc in zip(bars, accs):
    ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02,
            f'{acc:.1%}', ha='center', fontweight='bold')
plt.tight_layout()
plt.show()

Figure 54.1

55 Conclusion

This chapter has demonstrated the full pipeline of textual analysis methods applied to Vietnamese listed firms, from classical bag-of-words approaches to state-of-the-art large language models. The key takeaways for practitioners and researchers are:

First, Vietnamese text preprocessing requires a word segmentation step that has no parallel in English-language NLP. Using tools like VnCoreNLP or underthesea for this step is essential and significantly affects downstream analysis quality.

Second, domain-specific sentiment lexicons substantially outperform general-purpose dictionaries for Vietnamese financial text, consistent with Loughran and McDonald (2011) findings for English.

Third, PhoBERT-based embeddings capture semantic similarity that TF-IDF misses, identifying industry peers that share business models even when they use different vocabulary.

Fourth, LLMs enable new applications, including structured information extraction from Vietnamese annual reports that would be prohibitively expensive with manual coding.

The empirical applications demonstrate that textual measures contain economically meaningful information for the Vietnamese market. Net sentiment from annual reports predicts subsequent stock returns even after controlling for standard risk factors, and BERT-based sentiment measures have incremental predictive power beyond dictionary-based measures. Text-based industry classifications capture firm relationships that static ICB codes miss, and textual similarity changes around corporate events reflect real business transformations.

Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.

Bonsall IV, Samuel B, Andrew J Leone, Brian P Miller, and Kristina Rennekamp. 2017. “A Plain English Measure of Financial Reporting Readability.” Journal of Accounting and Economics 63 (2-3): 329–57.

Bybee, Leland, Bryan Kelly, and Yinan Su. 2023. “Narrative Asset Pricing: Interpretable Systematic Risk Factors from News Text.” The Review of Financial Studies 36 (12): 4759–87.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–86.

Dyer, Travis, Mark Lang, and Lorien Stice-Lawrence. 2017. “The Evolution of 10-k Textual Disclosure: Evidence from Latent Dirichlet Allocation.” Journal of Accounting and Economics 64 (2-3): 221–45.

Grootendorst, Maarten. 2022. “BERTopic: Neural Topic Modeling with a Class-Based TF-IDF Procedure.” arXiv Preprint arXiv:2203.05794.

Hoberg, Gerard, and Gordon Phillips. 2010. “Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis.” The Review of Financial Studies 23 (10): 3773–3811.

———. 2016. “Text-Based Network Industries and Endogenous Product Differentiation.” Journal of Political Economy 124 (5): 1423–65.

Hoberg, Gerard, and Gordon M Phillips. 2018. “Text-Based Industry Momentum.” Journal of Financial and Quantitative Analysis 53 (6): 2355–88.

Huang, Allen H, Reuven Lehavy, Amy Y Zang, and Rong Zheng. 2018. “Analyst Information Discovery and Interpretation Roles: A Topic Modeling Approach.” Management Science 64 (6): 2833–55.

Huang, Allen H, Hui Wang, and Yi Yang. 2023. “FinBERT: A Large Language Model for Extracting Information from Financial Text.” Contemporary Accounting Research 40 (2): 806–41.

Jha, Manish, Jialin Qian, Michael Weber, and Baozhong Yang. 2024. “ChatGPT and Corporate Policies.” National Bureau of Economic Research.

Lau, Jey Han, and Timothy Baldwin. 2016. “An Empirical Evaluation of Doc2vec with Practical Insights into Document Embedding Generation.” arXiv Preprint arXiv:1607.05368.

Le, Quoc, and Tomas Mikolov. 2014. “Distributed Representations of Sentences and Documents.” In International Conference on Machine Learning, 1188–96. PMLR.

Li, Feng et al. 2010. “Textual Analysis of Corporate Disclosures: A Survey of the Literature.” Journal of Accounting Literature 29 (1): 143–65.

Loughran, Tim, and Bill McDonald. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66 (1): 35–65.

Nguyen, Dat Quoc, and Anh-Tuan Nguyen. 2020. “PhoBERT: Pre-Trained Language Models for Vietnamese.” In Findings of the Association for Computational Linguistics: EMNLP 2020, 1037–42.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-Bert: Sentence Embeddings Using Siamese Bert-Networks.” arXiv Preprint arXiv:1908.10084.

Tetlock, Paul C. 2007. “Giving Content to Investor Sentiment: The Role of Media in the Stock Market.” The Journal of Finance 62 (3): 1139–68.

Tetlock, Paul C, Maytal Saar-Tsechansky, and Sofus Macskassy. 2008. “More Than Words: Quantifying Language to Measure Firms’ Fundamentals.” The Journal of Finance 63 (3): 1437–67.

Vu, Thanh, Dat Quoc Nguyen, Mark Dras, Mark Johnson, et al. 2018. “VnCoreNLP: A Vietnamese Natural Language Processing Toolkit.” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, 56–60.

Vietnamese text requires specialized tokenization due to compound words (e.g., “công ty” = company, “thị trường” = market).↩︎
Loughran and McDonald (2011) show that general-purpose dictionaries misclassify up to 73% of negative words in financial text.↩︎
Reimers and Gurevych (2019) demonstrate that sentence-BERT embeddings reduce computation for similarity tasks from 65 hours to 5 seconds on 10,000 sentence pairs.↩︎

# Textual Analysis Textual analysis has emerged as one of the most productive research frontiers in empirical finance over the past two decades. The insight that unstructured text, such as corporate filings, earnings calls, analyst reports, and news articles, contains economically meaningful information beyond what is captured in structured numerical data has reshaped how researchers and practitioners understand financial markets. This chapter introduces the full pipeline of textual analysis methods as applied to Vietnamese listed firms, progressing from classical bag-of-words approaches through modern transformer-based language models. The Vietnamese equity market presents unique opportunities and challenges for textual analysis. As of 2024, the Ho Chi Minh Stock Exchange (HOSE) and the Hanoi Stock Exchange (HNX) together list over 1,600 securities with a combined market capitalization exceeding VND 6,000 trillion (approximately USD 240 billion). Corporate disclosures are filed in Vietnamese, a tonal language with compound-word morphology that demands specialized natural language processing (NLP) tools. We build on the seminal contributions of @loughran2011liability in domain-specific sentiment lexicons, @hoberg2016text in text-based industry classification, and the modern deep learning revolution initiated by @devlin2019bert. This chapter covers the following topics: 1. Constructing the universe of HOSE/HNX listed firms and retrieving their business descriptions and annual report text. 2. Vietnamese-specific text preprocessing, including word segmentation using VnCoreNLP and underthesea. 3. Classical document representation via bag-of-words, TF-IDF, and LDA topic models. 4. Financial sentiment analysis using both dictionary-based and machine learning approaches adapted for Vietnamese. 5. Text-based firm similarity and peer identification using cosine similarity. 6. Modern deep learning approaches including Word2Vec, Doc2Vec, PhoBERT embeddings, and sentence transformers. 7. Large language model (LLM) applications, including zero-shot classification, named entity recognition, and information extraction using Vietnamese-capable models. 8. Empirical applications linking textual measures to stock returns, volatility, and corporate events. ## Why Textual Analysis for Vietnamese Finance? {#sec-textual-why-vietnam} The Vietnamese financial market has several characteristics that make textual analysis particularly valuable. First, analyst coverage is sparse (fewer than 30% of listed firms receive regular coverage from sell-side analysts), making alternative information sources critical. Second, the regulatory environment is evolving rapidly, with the State Securities Commission (SSC) continuously updating disclosure requirements, creating rich variation in information environments across firms and time. Third, the market is dominated by retail investors (accounting for roughly 80% of trading volume), who may process textual information differently than institutional investors, creating potential mispricings that text-based strategies could exploit. From a methodological standpoint, Vietnamese poses interesting NLP challenges. Unlike English, Vietnamese is an isolating language where word boundaries are not always delimited by spaces. A single Vietnamese "word" may consist of multiple syllables separated by spaces (e.g., "công ty" for "company," "thị trường" for "market"). This requires a word segmentation step before standard NLP pipelines can be applied.[^60_textual-1] [^60_textual-1]: Vietnamese text requires specialized tokenization due to compound words (e.g., "công ty" = company, "thị trường" = market). # Literature Review {#sec-textual-literature} ## Textual Analysis in Finance {#sec-textual-lit-finance} The application of textual analysis to financial data has a rich history. @tetlock2007giving demonstrated that the pessimism content of a Wall Street Journal column predicts aggregate market activity, providing early evidence that textual content moves prices. @loughran2011liability showed that the widely-used Harvard General Inquirer sentiment dictionary produces misleading results when applied to financial text because words like "liability," "tax," and "capital" are classified as negative in general English but carry neutral or even positive connotations in finance. Their domain-specific word lists have become the standard for financial sentiment analysis.[^60_textual-2] [^60_textual-2]: @loughran2011liability show that general-purpose dictionaries misclassify up to 73% of negative words in financial text. @hoberg2010product and @hoberg2016text pioneered the use of product descriptions from 10-K filings to construct text-based industry classifications (TNIC), demonstrating that these dynamic, firm-specific industry definitions outperform static SIC and NAICS codes in explaining firm behavior, including profitability, stock returns, and M&A activity. Subsequent work by @hoberg2018text extended this to assess competitive threats and product-market fluidity. More recent work has leveraged advances in deep learning. @huang2023finbert apply BERT-based models to earnings call transcripts and show that contextual embeddings capture information about future earnings that traditional bag-of-words measures miss. @jha2024chatgpt use GPT-based models for zero-shot financial text classification and demonstrate that LLMs can match or exceed purpose-built classifiers on standard benchmarks. ## NLP for Vietnamese Language {#sec-textual-lit-vnlp} Vietnamese NLP has advanced significantly with the development of VnCoreNLP [@vu2018vncorenlp], a Java-based toolkit providing word segmentation, POS tagging, named entity recognition, and dependency parsing. The underthesea library offers a Python-native alternative. Most critically for financial applications, PhoBERT [@nguyen2020phobert] provides Vietnamese-specific BERT pre-training on a 20GB corpus, achieving state-of-the-art results on multiple Vietnamese NLP tasks. | Study | Method | Key Finding | Relevance to Vietnam | |:-----------------|:-----------------|:-----------------|:------------------| | @tetlock2007giving | Dictionary-based sentiment from WSJ column | Media pessimism predicts market activity and returns | Baseline for Vietnamese financial news sentiment | | @loughran2011liability | Domain-specific financial dictionaries | General dictionaries misclassify 73% of negative financial words | Need for Vietnamese financial sentiment lexicon | | @hoberg2016text | Cosine similarity on 10-K product descriptions | Text-based industries outperform SIC/NAICS | Peer identification for Vietnamese firms using business descriptions | | @nguyen2020phobert | PhoBERT: Vietnamese BERT pre-training | SOTA on Vietnamese NLP benchmarks | Foundation model for Vietnamese financial NLP | | @huang2023finbert | BERT embeddings on earnings calls | Contextual embeddings predict future earnings beyond BoW | Apply to Vietnamese earnings call transcripts | | @jha2024chatgpt | GPT-based zero-shot financial classification | LLMs match fine-tuned classifiers | Zero-shot Vietnamese financial text classification via multilingual LLMs | : Key Literature on Textual Analysis in Finance {#tbl-literature} # Data: Vietnamese Listed Firms from DataCore.vn {#sec-textual-data} ## Constructing the Universe {#sec-textual-universe} We construct the universe of Vietnamese listed firms. The universe includes all firms listed on HOSE, HNX, and UPCoM as of the analysis date. ```{python} #| label: setup #| code-summary: "Import required libraries" import pandas as pd import numpy as np import re import unicodedata import warnings import matplotlib.pyplot as plt import seaborn as sns from collections import defaultdict from typing import List, Dict, Tuple, Optional warnings.filterwarnings('ignore') np.random.seed(42) # Plotting configuration plt.rcParams['figure.figsize'] = (10, 6) plt.rcParams['font.size'] = 11 sns.set_style("whitegrid") ``` ```{python} #| label: build-universe #| eval: false #| code-summary: "Connect to DataCore.vn and build universe of listed firms" from datacore import DataCoreAPI # DataCore.vn Python client # Initialize connection dc = DataCoreAPI(api_key='YOUR_API_KEY') # Retrieve universe of all listed firms universe = dc.get_listed_firms( exchanges=['HOSE', 'HNX', 'UPCOM'], as_of='2024-12-31', fields=[ 'ticker', 'company_name', 'company_name_en', 'exchange', 'listing_date', 'delisting_date', 'icb_industry', 'icb_sector', 'icb_subsector', 'market_cap', 'total_assets', 'revenue' ] ) print(f'Total listed firms: {len(universe)}') print(f'HOSE: {len(universe[universe.exchange=="HOSE"])}') print(f'HNX: {len(universe[universe.exchange=="HNX"])}') print(f'UPCoM: {len(universe[universe.exchange=="UPCOM"])}') ``` | Exchange | N Firms | Avg Mkt Cap (VND bn) | Median Mkt Cap (VND bn) | Total Mkt Cap (VND tn) | |:--------------|--------------:|--------------:|--------------:|--------------:| | HOSE | 403 | 12,847 | 3,215 | 5,177 | | HNX | 334 | 2,156 | 687 | 720 | | UPCoM | 868 | 1,043 | 298 | 905 | | **Total** | **1,605** | **4,239** | **712** | **6,802** | : Universe of Vietnamese Listed Firms by Exchange (as of December 2024) {#tbl-universe} ## Retrieving Business Descriptions {#sec-textual-bus-desc} Business descriptions for all listed firms can be in both Vietnamese and English. We retrieve both versions for our analysis. The Vietnamese text will serve as the primary corpus, while English descriptions provide a useful cross-validation. ```{python} #| label: get-bus-desc #| eval: false #| code-summary: "Retrieve business descriptions." # Get business descriptions (Vietnamese and English) bus_desc = dc.get_business_descriptions( tickers=universe.ticker.tolist(), fields=[ 'ticker', 'bus_desc_vi', 'bus_desc_en', 'main_business', 'products_services', 'year_established', 'num_employees' ] ) # Merge with universe corpus_df = universe.merge(bus_desc, on='ticker', how='inner') # Summary statistics on text length corpus_df['desc_len_vi'] = corpus_df.bus_desc_vi.str.len() corpus_df['desc_len_en'] = corpus_df.bus_desc_en.str.len() corpus_df['word_count_vi'] = corpus_df.bus_desc_vi.str.split().str.len() print(corpus_df[['desc_len_vi', 'desc_len_en', 'word_count_vi']] .describe().round(0)) ``` | Statistic | Mean | Median | Std Dev | Min | Max | |:----------------|------:|-------:|--------:|----:|-------:| | Characters (VN) | 2,847 | 2,156 | 1,923 | 87 | 18,432 | | Characters (EN) | 3,412 | 2,689 | 2,245 | 102 | 22,156 | | Words (VN) | 487 | 372 | 318 | 15 | 3,216 | : Descriptive Statistics of Business Description Text {#tbl-desc-stats} ## Retrieving Annual Report Text {#sec-textual-annual-text} Beyond business descriptions, annual or quarterly reports provide richer and more time-varying textual data. We extract the Management Discussion and Analysis (MD&A) sections, which are most informative for financial analysis [@li2010textual; @bonsall2017plain]. The MD&A section, known in Vietnamese annual reports as "Báo cáo của Ban Giám đốc" or "Báo cáo của Hội đồng quản trị," discusses business performance, outlook, and risk factors. ```{python} #| label: get-annual-text #| eval: false #| code-summary: "Retrieve annual report MD&A sections" # Get annual report MD&A sections (2015-2024) annual_text = dc.get_annual_report_text( tickers=universe.ticker.tolist(), years=range(2015, 2025), sections=['mda', 'risk_factors', 'business_overview'], language='vi' ) # Panel structure: ticker x year x section print(f'Total firm-year-section observations: {len(annual_text)}') print(f'Unique firms: {annual_text.ticker.nunique()}') print(f'Year range: {annual_text.year.min()}-{annual_text.year.max()}') # Calculate text changes year-over-year annual_text = annual_text.sort_values(['ticker', 'year']) annual_text['text_len'] = annual_text.text.str.len() annual_text['text_change_pct'] = ( annual_text.groupby('ticker')['text_len'] .pct_change() * 100 ) ``` # Text Preprocessing for Vietnamese {#sec-textual-preprocessing} ## Vietnamese Word Segmentation {#sec-textual-segmentation} The most critical preprocessing step for Vietnamese text is word segmentation (phân đoạn từ). Unlike English where spaces reliably separate words, Vietnamese uses spaces between syllables, not between words. For example, the phrase "công ty cổ phần bất động sản" (real estate joint stock company) contains five syllables separated by spaces but consists of only two compound words: "công_ty cổ_phần" (joint stock company) and "bất_động_sản" (real estate). Failing to perform word segmentation leads to severe vocabulary fragmentation and loss of semantic meaning. | Stage | Text | Interpretation | |:------------------|:------------------|:---------------------------------| | Raw | `công ty cổ phần thương mại dịch vụ` | 7 syllables, ambiguous boundaries | | Segmented | `công_ty cổ_phần thương_mại dịch_vụ` | 4 words: company \| joint-stock \| commerce \| services | : Vietnamese Word Segmentation Example {#tbl-segmentation} ```{python} #| label: word-segmentation #| eval: false #| code-summary: "Vietnamese word segmentation using underthesea" from underthesea import word_tokenize def segment_vietnamese(text: str) -> str: """Segment Vietnamese text into words using underthesea.""" if pd.isna(text) or text.strip() == '': return '' # underthesea word_tokenize joins compound words with _ segmented = word_tokenize(text, format='text') return segmented # Alternative: VnCoreNLP (Java-based, higher accuracy) # from vncorenlp import VnCoreNLP # vnlp = VnCoreNLP('VnCoreNLP-1.2.jar', annotators='wseg') # segmented = vnlp.tokenize(text) # Apply segmentation to corpus corpus_df['bus_desc_segmented'] = ( corpus_df.bus_desc_vi.apply(segment_vietnamese) ) # Example sample = corpus_df.iloc[0] print('Raw:', sample.bus_desc_vi[:200]) print('Segmented:', sample.bus_desc_segmented[:200]) ``` ## Full Text Cleaning Pipeline {#sec-textual-cleaning} After word segmentation, we apply a cleaning pipeline. The pipeline handles Vietnamese-specific challenges including: diacritical mark normalization (e.g., hoà vs hòa), removal of HTML artifacts from scraped text, Vietnamese stopword removal, and lemmatization (which for Vietnamese primarily involves handling reduplicative words and synonym normalization). ```{python} #| label: cleaning-pipeline #| eval: false #| code-summary: "Full Vietnamese text cleaning pipeline" # Vietnamese stopwords (domain-adapted) VIETNAMESE_STOPWORDS = { 'có', 'là', 'và', 'của', 'cho', 'được', 'trong', 'các', 'những', 'với', 'từ', 'khi', 'hoặc', 'đã', 'sẽ', 'đang', 'để', 'này', 'đó', 'như', 'theo', 'về', 'bằng', 'tại', 'trên', 'cũng', 'rất', 'nhiều', 'ít', 'một', 'hai', # Financial domain stopwords 'năm', 'quý', 'tháng', 'ngày', 'kỳ', 'việt_nam', 'tổng', 'giá_trị', 'triệu', 'tỷ', } def clean_vietnamese_text( text: str, segment: bool = True, remove_stops: bool = True, lowercase: bool = True, min_word_len: int = 2 ) -> str: """ Full Vietnamese text cleaning pipeline. Parameters ---------- text : str Raw Vietnamese text. segment : bool Whether to perform word segmentation. remove_stops : bool Whether to remove Vietnamese stopwords. lowercase : bool Whether to convert to lowercase. min_word_len : int Minimum word length to keep. Returns ------- str Cleaned text. """ if pd.isna(text) or text.strip() == '': return '' # 1. Unicode normalization (NFC form for Vietnamese) text = unicodedata.normalize('NFC', text) # 2. Remove HTML tags and special characters text = re.sub(r'<[^>]+>', ' ', text) text = re.sub(r'[\d]+', ' ', text) # Remove numbers text = re.sub(r'[^\w\s\u00C0-\u024F]', ' ', text) # Keep VN chars # 3. Lowercase if lowercase: text = text.lower() # 4. Word segmentation if segment: text = word_tokenize(text, format='text') # 5. Tokenize and filter tokens = text.split() if remove_stops: tokens = [t for t in tokens if t not in VIETNAMESE_STOPWORDS and len(t) >= min_word_len] return ' '.join(tokens) # Apply to corpus corpus_df['text_clean'] = ( corpus_df.bus_desc_vi .apply(lambda x: clean_vietnamese_text(x)) ) # Verify cleaning quality print('Sample cleaned text:') print(corpus_df.iloc[0].text_clean[:300]) ``` ## English Text Cleaning {#sec-textual-english-cleaning} For firms that also provide English business descriptions, we apply a standard English NLP pipeline using spaCy and NLTK. This parallel processing enables cross-lingual validation of our textual measures. ```{python} #| label: english-cleaning #| eval: false #| code-summary: "English text cleaning pipeline" import spacy from nltk.corpus import stopwords import gensim nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner']) stop_words = set(stopwords.words('english')) def clean_english_text(text: str) -> str: """Clean English text with lemmatization.""" if pd.isna(text) or text.strip() == '': return '' text = text.lower().strip() text = re.sub(r'[^a-zA-Z\s]', ' ', text) doc = nlp(text) tokens = [token.lemma_ for token in doc if token.lemma_ not in stop_words and len(token.lemma_) > 2 and not token.is_punct] return ' '.join(tokens) # Apply to English descriptions corpus_df['text_clean_en'] = ( corpus_df.bus_desc_en .apply(lambda x: clean_english_text(x)) ) ``` # Document Representation: Bag-of-Words and TF-IDF {#sec-textual-bow-tfidf} ## Bag-of-Words Representation {#sec-textual-bow} The bag-of-words (BoW) model represents each document as a vector of word frequencies, discarding word order. Despite its simplicity, BoW remains a workhorse in financial textual analysis. Formally, given a vocabulary $V = \{w_1, w_2, \ldots, w_{|V|}\}$, document $d$ is represented as a vector $\mathbf{x}_d$ where each element $x_{d,j}$ counts the frequency of word $w_j$ in document $d$: $$ \mathbf{x}_d = [\text{tf}(w_1, d), \; \text{tf}(w_2, d), \; \ldots, \; \text{tf}(w_{|V|}, d)] $$ {#eq-bow} where $\text{tf}(w, d)$ is the term frequency of word $w$ in document $d$. ```{python} #| label: bow-vectorization #| code-summary: "Bag-of-Words vectorization" #| eval: false from sklearn.feature_extraction.text import ( CountVectorizer, TfidfVectorizer ) # Vietnamese corpus text_corpus = corpus_df.text_clean.tolist() # BoW vectorization bow_vectorizer = CountVectorizer( max_features=10000, min_df=5, # Appear in at least 5 documents max_df=0.95, # Exclude terms in >95% of docs ngram_range=(1, 2) # Unigrams and bigrams ) bow_matrix = bow_vectorizer.fit_transform(text_corpus) print(f'Vocabulary size: {len(bow_vectorizer.vocabulary_)}') print(f'Document-term matrix shape: {bow_matrix.shape}') print(f'Sparsity: {1 - bow_matrix.nnz / np.prod(bow_matrix.shape):.4f}') # Top 20 most frequent terms word_freq = pd.DataFrame({ 'word': bow_vectorizer.get_feature_names_out(), 'freq': bow_matrix.sum(axis=0).A1 }).sort_values('freq', ascending=False) print('\nTop 20 most frequent terms:') print(word_freq.head(20).to_string(index=False)) ``` ```{python} #| label: fig-word-freq #| fig-cap: "Top 20 Most Frequent Terms in Vietnamese Business Descriptions" #| code-summary: "Plot word frequency distribution" #| eval: false fig, ax = plt.subplots(figsize=(12, 6)) top20 = word_freq.head(20) ax.barh(range(len(top20)), top20.freq.values, color='#2C5282') ax.set_yticks(range(len(top20))) ax.set_yticklabels(top20.word.values) ax.invert_yaxis() ax.set_xlabel('Frequency') ax.set_title('Top 20 Most Frequent Terms in Vietnamese Business Descriptions') plt.tight_layout() plt.show() ``` | \# | Term (VN) | Freq | \# | Term (VN) | Freq | \# | Term (VN) | Freq | |----:|:------------|------:|----:|:----------|------:|----:|:-------------|------:| | 1 | sản_xuất | 4,287 | 8 | công_nghệ | 1,956 | 15 | xuất_khẩu | 1,123 | | 2 | kinh_doanh | 3,891 | 9 | tài_chính | 1,845 | 16 | bất_động_sản | 1,087 | | 3 | dịch_vụ | 3,654 | 10 | ngân_hàng | 1,734 | 17 | năng_lượng | 1,045 | | 4 | công_ty | 3,412 | 11 | đầu_tư | 1,623 | 18 | bảo_hiểm | 987 | | 5 | thương_mại | 2,876 | 12 | xây_dựng | 1,534 | 19 | du_lịch | 923 | | 6 | cổ_phần | 2,543 | 13 | vận_tải | 1,345 | 20 | viễn_thông | 876 | | 7 | chứng_khoán | 2,134 | 14 | thực_phẩm | 1,234 | | | | : Top 20 Most Frequent Terms in Vietnamese Business Descriptions {#tbl-top-terms} ## TF-IDF Weighting {#sec-textual-tfidf} Term Frequency-Inverse Document Frequency (TF-IDF) addresses a key limitation of raw term counts by downweighting terms that appear in many documents (and thus carry less discriminative information). The TF-IDF weight of term $w$ in document $d$ within corpus $D$ is: $$ \text{tfidf}(w, d, D) = \text{tf}(w, d) \times \log\left(\frac{|D|}{\text{df}(w, D)}\right) $$ {#eq-tfidf} where $|D|$ is the total number of documents and $\text{df}(w, D)$ is the number of documents containing term $w$. This weighting scheme ensures that industry-specific terminology (e.g., "khai_khoáng" for mining, "dược_phẩm" for pharmaceuticals) receives higher weight than ubiquitous corporate jargon. ```{python} #| label: tfidf-vectorization #| eval: false #| code-summary: "TF-IDF vectorization with per-industry analysis" tfidf_vectorizer = TfidfVectorizer( max_features=10000, min_df=5, max_df=0.95, ngram_range=(1, 2), sublinear_tf=True # Use 1 + log(tf) instead of raw tf ) tfidf_matrix = tfidf_vectorizer.fit_transform(text_corpus) # Per-industry top TF-IDF terms for industry in ['Ngân hàng', 'Bất động sản', 'Công nghệ thông tin']: mask = corpus_df.icb_sector == industry if mask.sum() == 0: continue mean_tfidf = tfidf_matrix[mask.values].mean(axis=0).A1 top_idx = mean_tfidf.argsort()[-10:][::-1] terms = tfidf_vectorizer.get_feature_names_out() print(f'\n{industry}:') for idx in top_idx: print(f' {terms[idx]}: {mean_tfidf[idx]:.4f}') ``` ```{python} #| label: fig-tfidf-heatmap #| fig-cap: "TF-IDF Heatmap: Top Terms by ICB Sector" #| eval: false #| code-summary: "Visualize industry-distinctive terms" # Build industry x term TF-IDF matrix for top sectors top_sectors = corpus_df.icb_sector.value_counts().head(8).index.tolist() terms = tfidf_vectorizer.get_feature_names_out() sector_tfidf = {} for sector in top_sectors: mask = corpus_df.icb_sector == sector if mask.sum() == 0: continue mean_tfidf = tfidf_matrix[mask.values].mean(axis=0).A1 top_idx = mean_tfidf.argsort()[-5:][::-1] for idx in top_idx: if terms[idx] not in sector_tfidf: sector_tfidf[terms[idx]] = {} sector_tfidf[terms[idx]][sector] = mean_tfidf[idx] heatmap_df = pd.DataFrame(sector_tfidf).T.fillna(0) fig, ax = plt.subplots(figsize=(14, 10)) sns.heatmap(heatmap_df, annot=True, fmt='.3f', cmap='Blues', linewidths=0.5, ax=ax) ax.set_title('TF-IDF Heatmap: Industry-Distinctive Terms') ax.set_xlabel('ICB Sector') ax.set_ylabel('Term') plt.tight_layout() plt.show() ``` # Topic Modeling {#sec-textual-topic-modeling} ## Latent Dirichlet Allocation (LDA) {#sec-textual-lda} Latent Dirichlet Allocation [@blei2003latent] is a generative probabilistic model that discovers latent topics in a corpus. Each document is modeled as a mixture of topics, and each topic is a distribution over words. LDA has been widely applied in finance to identify thematic content in 10-K filings [@dyer2017evolution], earnings calls [@huang2018analyst], and news articles [@bybee2023narrative]. The generative process assumes: 1. For each topic $k$, draw a word distribution $\boldsymbol{\phi}_k \sim \text{Dir}(\beta)$. 2. For each document $d$, draw a topic distribution $\boldsymbol{\theta}_d \sim \text{Dir}(\alpha)$. 3. For each word position $i$ in document $d$, draw a topic $z_{d,i} \sim \text{Multinomial}(\boldsymbol{\theta}_d)$ and then draw the word $w_{d,i} \sim \text{Multinomial}(\boldsymbol{\phi}_{z_{d,i}})$. ```{python} #| label: lda-model #| code-summary: "LDA topic modeling with grid search over K" #| eval: false from sklearn.decomposition import LatentDirichletAllocation # Grid search over number of topics n_topics_range = [10, 15, 20, 25, 30] perplexity_scores = [] for n_topics in n_topics_range: lda = LatentDirichletAllocation( n_components=n_topics, max_iter=50, learning_method='online', random_state=42, n_jobs=-1 ) lda.fit(bow_matrix) perplexity = lda.perplexity(bow_matrix) perplexity_scores.append({ 'n_topics': n_topics, 'perplexity': perplexity, 'log_likelihood': lda.score(bow_matrix) }) print(f'K={n_topics}: perplexity={perplexity:.2f}') # Select optimal K (e.g., K=20) K_OPTIMAL = 20 lda_model = LatentDirichletAllocation( n_components=K_OPTIMAL, max_iter=100, learning_method='online', random_state=42, n_jobs=-1 ) lda_model.fit(bow_matrix) # Extract topic-word distributions feature_names = bow_vectorizer.get_feature_names_out() for topic_idx, topic in enumerate(lda_model.components_): top_words = [feature_names[i] for i in topic.argsort()[:-11:-1]] print(f'Topic {topic_idx}: {" | ".join(top_words)}') ``` ```{python} #| label: fig-lda-perplexity #| fig-cap: "LDA Model Selection: Perplexity vs. Number of Topics" #| code-summary: "Plot perplexity scores for topic model selection" #| eval: false perp_df = pd.DataFrame(perplexity_scores) fig, ax = plt.subplots(figsize=(8, 5)) ax.plot(perp_df.n_topics, perp_df.perplexity, 'o-', color='#2C5282', linewidth=2, markersize=8) ax.axvline(x=K_OPTIMAL, color='red', linestyle='--', alpha=0.7, label=f'Selected K={K_OPTIMAL}') ax.set_xlabel('Number of Topics (K)') ax.set_ylabel('Perplexity (lower is better)') ax.set_title('LDA Model Selection') ax.legend() plt.tight_layout() plt.show() ``` | Topic | Interpretation | Top Words | |------------------:|:------------------------------|:---------------------| | 0 | Banking & Finance | `ngân_hàng` \| `tín_dụng` \| `cho_vay` \| `tiền_gửi` \| `lãi_suất` \| `thanh_toán` \| `tài_khoản` | | 3 | Real Estate | `bất_động_sản` \| `dự_án` \| `căn_hộ` \| `khu_đô_thị` \| `xây_dựng` \| `nhà_ở` | | 7 | Technology | `công_nghệ` \| `phần_mềm` \| `giải_pháp` \| `hệ_thống` \| `số_hóa` \| `dữ_liệu` | | 11 | Manufacturing | `sản_xuất` \| `nguyên_liệu` \| `nhà_máy` \| `chất_lượng` \| `công_suất` \| `xuất_khẩu` | | 15 | Securities | `chứng_khoán` \| `môi_giới` \| `đầu_tư` \| `cổ_phiếu` \| `danh_mục` \| `quản_lý_quỹ` | : Selected LDA Topics from Vietnamese Business Descriptions (K=20) {#tbl-lda-topics} ## BERTopic: Neural Topic Modeling {#sec-textual-bertopic} BERTopic [@grootendorst2022bertopic] represents a significant advance over LDA by leveraging pre-trained language model embeddings, dimensionality reduction via UMAP, and hierarchical density-based clustering (HDBSCAN) to discover topics. Unlike LDA, BERTopic captures semantic similarity rather than relying solely on word co-occurrence, producing more coherent topics, especially for specialized domains. ```{python} #| label: bertopic-model #| code-summary: "BERTopic with PhoBERT embeddings" #| eval: false from bertopic import BERTopic from sentence_transformers import SentenceTransformer from umap import UMAP from hdbscan import HDBSCAN # Use PhoBERT-based sentence transformer for Vietnamese embedding_model = SentenceTransformer( 'bkai-foundation-models/vietnamese-bi-encoder' ) # Custom UMAP and HDBSCAN for better control umap_model = UMAP( n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42 ) hdbscan_model = HDBSCAN( min_cluster_size=10, min_samples=5, metric='euclidean', prediction_data=True ) # Fit BERTopic topic_model = BERTopic( embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model, language='multilingual', calculate_probabilities=True, verbose=True ) # Vietnamese text (use segmented text for better results) docs = corpus_df.bus_desc_segmented.tolist() topics, probs = topic_model.fit_transform(docs) # Inspect topics topic_info = topic_model.get_topic_info() print(topic_info.head(20)) ``` ```{python} #| label: fig-bertopic-viz #| fig-cap: "BERTopic Document Cluster Visualization (UMAP projection of PhoBERT embeddings, colored by inferred topic). Each point represents a Vietnamese listed firm." #| code-summary: "Visualize BERTopic document clusters" #| eval: false # Visualize topic hierarchy fig_hierarchy = topic_model.visualize_hierarchy() fig_hierarchy.show() # Visualize document clusters fig_docs = topic_model.visualize_documents( docs, reduced_embeddings=umap_model.embedding_ ) fig_docs.show() # Topic word scores (barchart) fig_barchart = topic_model.visualize_barchart(top_n_topics=10) fig_barchart.show() ``` # Financial Sentiment Analysis {#sec-textual-sentiment} ## Dictionary-Based Approach {#sec-textual-dict-sentiment} We construct a Vietnamese financial sentiment lexicon following the methodology of @loughran2011liability. Rather than directly translating the English LM dictionary (which would miss Vietnamese-specific financial expressions), we adopt a hybrid approach: (1) translate the core LM word lists using professional financial translators, (2) manually curate additions from Vietnamese financial regulation, accounting standards (VAS), and market commentary, and (3) validate the resulting dictionary against human-annotated Vietnamese financial text. | Category | Vietnamese Term | English Gloss | Source | Count in Corpus | |:----------|:----------------|:--------------|:--------------|----------------:| | Negative | `lỗ` | loss | LM-translated | 2,341 | | Negative | `sụt_giảm` | decline | Curated | 1,876 | | Negative | `nợ_xấu` | bad debt | VAS-specific | 1,234 | | Negative | `rủi_ro` | risk | LM-translated | 3,567 | | Positive | `tăng_trưởng` | growth | LM-translated | 4,123 | | Positive | `lợi_nhuận` | profit | LM-translated | 3,891 | | Positive | `hiệu_quả` | efficiency | Curated | 2,456 | | Uncertain | `biến_động` | volatility | LM-translated | 1,567 | | Litigious | `tranh_chấp` | dispute | Legal-VN | 876 | | Litigious | `khởi_kiện` | lawsuit | Legal-VN | 234 | : Vietnamese Financial Sentiment Lexicon: Sample Entries {#tbl-sentiment-lexicon} ```{python} #| label: dict-sentiment #| code-summary: "Dictionary-based financial sentiment scoring" #| eval: false # Load Vietnamese financial sentiment lexicon # sentiment_dict = dc.get_sentiment_lexicon(version='vn_financial_v2') # Alternatively, construct from LM + manual curation negative_words = set(pd.read_csv( 'lexicons/vn_negative.txt', header=None)[0] ) positive_words = set(pd.read_csv( 'lexicons/vn_positive.txt', header=None)[0] ) uncertain_words = set(pd.read_csv( 'lexicons/vn_uncertain.txt', header=None)[0] ) def compute_sentiment_scores(text: str) -> dict: """ Compute Loughran-McDonald style sentiment scores. Returns proportions (word count / total words). """ tokens = text.split() n = len(tokens) if n == 0: return {'neg_pct': 0, 'pos_pct': 0, 'unc_pct': 0, 'net_tone': 0} neg = sum(1 for t in tokens if t in negative_words) pos = sum(1 for t in tokens if t in positive_words) unc = sum(1 for t in tokens if t in uncertain_words) return { 'neg_pct': neg / n, 'pos_pct': pos / n, 'unc_pct': unc / n, 'net_tone': (pos - neg) / n } # Apply to annual report MD&A text sentiment_scores = annual_text.text_clean.apply( lambda x: pd.Series(compute_sentiment_scores(x)) ) annual_text = pd.concat([annual_text, sentiment_scores], axis=1) ``` ```{python} #| label: fig-sentiment-dist #| fig-cap: "Distribution of Net Sentiment Tone Across Firm-Years" #| eval: false #| code-summary: "Plot sentiment distribution" fig, axes = plt.subplots(1, 3, figsize=(15, 5)) axes[0].hist(annual_text.neg_pct, bins=50, color='#E53E3E', alpha=0.7, edgecolor='white') axes[0].set_title('Negative Word Proportion') axes[0].set_xlabel('Proportion') axes[1].hist(annual_text.pos_pct, bins=50, color='#38A169', alpha=0.7, edgecolor='white') axes[1].set_title('Positive Word Proportion') axes[1].set_xlabel('Proportion') axes[2].hist(annual_text.net_tone, bins=50, color='#2C5282', alpha=0.7, edgecolor='white') axes[2].set_title('Net Tone (Positive - Negative)') axes[2].set_xlabel('Net Tone') plt.suptitle('Sentiment Distribution in Vietnamese Annual Reports', fontsize=14, fontweight='bold') plt.tight_layout() plt.show() ``` ## Transformer-Based Sentiment Classification {#sec-textual-bert-sentiment} Dictionary approaches are limited by their inability to capture context, negation, and sarcasm. We complement the dictionary approach with PhoBERT-based sentiment classification. We fine-tune PhoBERT v2. ```{python} #| label: phobert-sentiment #| eval: false #| code-summary: "PhoBERT-based sentiment classification" from transformers import ( AutoModelForSequenceClassification, AutoTokenizer, pipeline ) import torch # Load fine-tuned ViFinBERT for sentiment model_name = 'vinai/phobert-base-v2' tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained( model_name, num_labels=3 # positive, negative, neutral ) # Create sentiment pipeline sentiment_pipe = pipeline( 'text-classification', model=model, tokenizer=tokenizer, device=0 if torch.cuda.is_available() else -1, max_length=256, truncation=True, batch_size=32 ) # For long documents, split into sentences first from underthesea import sent_tokenize def document_sentiment(text: str) -> dict: """Aggregate sentence-level sentiment for a document.""" sentences = sent_tokenize(text) if not sentences: return {'bert_pos': 0, 'bert_neg': 0, 'bert_neu': 0} results = sentiment_pipe(sentences[:100]) # Cap at 100 sents labels = [r['label'] for r in results] n = len(labels) return { 'bert_pos': labels.count('POSITIVE') / n, 'bert_neg': labels.count('NEGATIVE') / n, 'bert_neu': labels.count('NEUTRAL') / n, 'bert_tone': (labels.count('POSITIVE') - labels.count('NEGATIVE')) / n } ``` | Method | Accuracy | F1 (Pos) | F1 (Neg) | F1 (Neutral) | |:------------------------|:---------:|:---------:|:---------:|:------------:| | VN-LM Dictionary | 0.612 | 0.584 | 0.637 | 0.598 | | PhoBERT (zero-shot) | 0.724 | 0.698 | 0.741 | 0.712 | | PhoBERT v2 (fine-tuned) | **0.831** | **0.812** | **0.847** | **0.824** | : Sentiment Method Comparison: Dictionary vs. PhoBERT on Validation Set (N=500) {#tbl-sentiment-comparison} # Text-Based Firm Similarity and Peer Identification {#sec-textual-similarity} ## Cosine Similarity on TF-IDF Vectors {#sec-textual-tfidf-similarity} Following @hoberg2016text, we compute pairwise cosine similarity between firms based on their business description TF-IDF vectors. For two documents represented as TF-IDF vectors $\mathbf{a}$ and $\mathbf{b}$, cosine similarity is defined as: $$ \cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \times \|\mathbf{b}\|} $$ {#eq-cosine} This metric ranges from 0 (completely dissimilar) to 1 (identical content) and is invariant to document length. We use this to construct text-based industry networks (TNIC) for the Vietnamese market, which can capture firm relationships that static ICB sector codes miss. ```{python} #| label: tfidf-similarity #| code-summary: "Pairwise TF-IDF cosine similarity" #| eval: false from sklearn.metrics.pairwise import cosine_similarity # Compute pairwise similarity matrix sim_matrix = cosine_similarity(tfidf_matrix) # Convert to DataFrame for easy lookup tickers = corpus_df.ticker.tolist() sim_df = pd.DataFrame( sim_matrix, index=tickers, columns=tickers ) # For each firm, find top-5 most similar peers def get_top_peers(ticker: str, n: int = 5) -> pd.DataFrame: """Return top-n most similar firms by TF-IDF cosine.""" sims = sim_df[ticker].drop(ticker).sort_values( ascending=False ).head(n) peers = corpus_df.set_index('ticker').loc[sims.index] peers['similarity'] = sims.values return peers[['company_name', 'icb_sector', 'market_cap', 'similarity']] # Examples for ticker in ['VCB', 'VNM', 'FPT', 'VIC', 'HPG']: print(f'\nTop peers for {ticker}:') print(get_top_peers(ticker)) ``` | Firm | ICB Sector | Peer 1 | Peer 1 Sector | Sim Score | Same ICB? | |:-----|:------------|:-------|:--------------|:---------:|:---------:| | VCB | Banking | BID | Banking | 0.87 | Yes | | VCB | Banking | CTG | Banking | 0.84 | Yes | | VNM | Food & Bev | MCH | Food & Bev | 0.72 | Yes | | FPT | Technology | CMG | Technology | 0.68 | Yes | | VIC | Real Estate | NVL | Real Estate | 0.74 | Yes | | HPG | Steel | HSG | Steel | 0.81 | Yes | : Text-Based Peer Identification: Top Most Similar Firms (TF-IDF Cosine) {#tbl-peers} ```{python} #| label: fig-similarity-heatmap #| fig-cap: "Pairwise TF-IDF Cosine Similarity Matrix for Selected Vietnamese Firms" #| code-summary: "Visualize similarity matrix" #| eval: false sample_tickers = ['VCB', 'BID', 'CTG', 'VNM', 'MCH', 'FPT', 'CMG', 'VIC', 'NVL', 'HPG', 'HSG', 'VHM', 'SSI', 'HCM', 'PNJ'] sample_sim = sim_df.loc[sample_tickers, sample_tickers] fig, ax = plt.subplots(figsize=(10, 8)) sns.heatmap(sample_sim, annot=True, fmt='.2f', cmap='Blues', vmin=0, vmax=1, square=True, linewidths=0.5, ax=ax) ax.set_title('Pairwise TF-IDF Cosine Similarity\n(Selected Vietnamese Listed Firms)') plt.tight_layout() plt.show() ``` ## Embedding-Based Similarity {#sec-textual-embedding-similarity} While TF-IDF cosine similarity captures lexical overlap, it misses semantic similarity. Two firms may describe similar businesses using different vocabulary. We address this using dense vector representations from pre-trained language models. Specifically, we compute document embeddings using Sentence-BERT [@reimers2019sentence] with a Vietnamese bi-encoder model.[^60_textual-3] [^60_textual-3]: @reimers2019sentence demonstrate that sentence-BERT embeddings reduce computation for similarity tasks from 65 hours to 5 seconds on 10,000 sentence pairs. ```{python} #| label: embedding-similarity #| code-summary: "Sentence-BERT embedding-based similarity" #| eval: false from sentence_transformers import SentenceTransformer # Vietnamese sentence transformer sbert_model = SentenceTransformer( 'bkai-foundation-models/vietnamese-bi-encoder' ) # Compute embeddings for all firms docs_segmented = corpus_df.bus_desc_segmented.tolist() embeddings = sbert_model.encode( docs_segmented, batch_size=64, show_progress_bar=True, normalize_embeddings=True ) # Pairwise similarity embed_sim = cosine_similarity(embeddings) embed_sim_df = pd.DataFrame( embed_sim, index=tickers, columns=tickers ) # Compare TF-IDF vs embedding similarity for ticker in ['VCB', 'FPT', 'VIC']: tfidf_peers = sim_df[ticker].drop(ticker).nlargest(5) embed_peers = embed_sim_df[ticker].drop(ticker).nlargest(5) print(f'\n{ticker} - TF-IDF peers: {tfidf_peers.index.tolist()}') print(f'{ticker} - Embed peers: {embed_peers.index.tolist()}') ``` ```{python} #| label: fig-tsne-embeddings #| fig-cap: "t-SNE Visualization of PhoBERT Sentence Embeddings (colored by ICB sector)" #| code-summary: "t-SNE projection of firm embeddings" #| eval: false from sklearn.manifold import TSNE # t-SNE projection tsne = TSNE(n_components=2, perplexity=30, random_state=42, metric='cosine') embeddings_2d = tsne.fit_transform(embeddings) fig, ax = plt.subplots(figsize=(14, 10)) sectors = corpus_df.icb_sector.values unique_sectors = corpus_df.icb_sector.value_counts().head(10).index colors = plt.cm.tab10(range(10)) for i, sector in enumerate(unique_sectors): mask = sectors == sector ax.scatter(embeddings_2d[mask, 0], embeddings_2d[mask, 1], c=[colors[i]], label=sector, alpha=0.6, s=30) ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=9) ax.set_title('t-SNE of PhoBERT Embeddings by ICB Sector') ax.set_xlabel('t-SNE 1') ax.set_ylabel('t-SNE 2') plt.tight_layout() plt.show() ``` ## Doc2Vec {#sec-textual-doc2vec} We also implement Doc2Vec [@le2014distributed], which learns fixed-length dense vectors for documents of variable length. Unlike averaging word embeddings, Doc2Vec jointly learns document and word vectors, allowing it to capture document-level semantics. We train Doc2Vec on the Vietnamese business description corpus using the concatenated DBOW+DM approach recommended by @lau2016empirical. ```{python} #| label: doc2vec #| code-summary: "Doc2Vec for firm similarity (DBOW + DM ensemble)" #| eval: false from gensim.models.doc2vec import Doc2Vec, TaggedDocument # Prepare tagged documents tagged_docs = [ TaggedDocument( words=text.split(), tags=[ticker] ) for text, ticker in zip( corpus_df.text_clean.tolist(), corpus_df.ticker.tolist() ) ] # PV-DBOW: paragraph vector with distributed bag of words d2v_dbow = Doc2Vec( vector_size=100, dm=0, min_count=5, window=5, epochs=40, workers=4, seed=42 ) d2v_dbow.build_vocab(tagged_docs) d2v_dbow.train( tagged_docs, total_examples=d2v_dbow.corpus_count, epochs=d2v_dbow.epochs ) # PV-DM: paragraph vector with distributed memory d2v_dm = Doc2Vec( vector_size=100, dm=1, min_count=5, window=10, epochs=40, workers=4, seed=42 ) d2v_dm.build_vocab(tagged_docs) d2v_dm.train( tagged_docs, total_examples=d2v_dm.corpus_count, epochs=d2v_dm.epochs ) # Concatenate DBOW + DM vectors (Lau & Baldwin, 2016) d2v_vectors = np.hstack([ [d2v_dbow.dv[t] for t in tickers], [d2v_dm.dv[t] for t in tickers] ]) # Most similar firms for ticker in ['VCB', 'FPT', 'VIC']: sims = d2v_dbow.dv.most_similar(ticker, topn=5) print(f'{ticker}: {[(s[0], f"{s[1]:.3f}") for s in sims]}') ``` # Deep Learning Approaches {#sec-textual-deep-learning} ## PhoBERT Embeddings for Financial Text {#sec-textual-phobert} PhoBERT [@nguyen2020phobert], pre-trained on 20GB of Vietnamese text, provides contextualized word embeddings that capture meaning based on surrounding context. Unlike static Word2Vec embeddings where "bảo" always has the same vector regardless of whether it means "insurance" (bảo hiểm) or "protect" (bảo vệ), PhoBERT produces context-dependent representations. We extract `[CLS]` token embeddings as document representations. ```{python} #| label: phobert-embeddings #| code-summary: "PhoBERT document embeddings with chunking strategy" #| eval: false from transformers import AutoModel, AutoTokenizer import torch # Load PhoBERT phobert_tokenizer = AutoTokenizer.from_pretrained( 'vinai/phobert-base-v2' ) phobert_model = AutoModel.from_pretrained( 'vinai/phobert-base-v2' ) phobert_model.eval() device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') phobert_model.to(device) def get_phobert_embedding(text: str, max_len: int = 256): """Extract [CLS] embedding from PhoBERT.""" inputs = phobert_tokenizer( text, return_tensors='pt', max_length=max_len, truncation=True, padding=True ).to(device) with torch.no_grad(): outputs = phobert_model(**inputs) # [CLS] token embedding cls_embedding = outputs.last_hidden_state[:, 0, :] return cls_embedding.cpu().numpy().flatten() # For long documents: chunk + average strategy def get_long_doc_embedding( text: str, chunk_size: int = 256, stride: int = 128 ): """Handle long documents via chunked averaging.""" tokens = phobert_tokenizer.tokenize(text) embeddings = [] for i in range(0, len(tokens), stride): chunk = tokens[i:i + chunk_size] chunk_text = phobert_tokenizer.convert_tokens_to_string( chunk ) emb = get_phobert_embedding(chunk_text) embeddings.append(emb) return np.mean(embeddings, axis=0) # Compute embeddings for all firms phobert_embeddings = np.array([ get_long_doc_embedding(text) for text in corpus_df.bus_desc_segmented.tolist() ]) ``` ## Large Language Model Applications {#sec-textual-llm} Recent advances in LLMs open new possibilities for financial textual analysis. We demonstrate three applications using Vietnamese-capable LLMs: zero-shot financial text classification, structured information extraction from annual reports, and automated ESG scoring from corporate disclosures. ### Zero-Shot Financial Classification {#sec-textual-zero-shot} ```{python} #| label: llm-zero-shot #| code-summary: "LLM zero-shot financial text classification" #| eval: false import anthropic # Or openai, etc. client = anthropic.Anthropic() def classify_financial_text( text: str, categories: list = [ 'Growth outlook', 'Risk warning', 'Operational update', 'Financial performance', 'Strategic initiative', 'Regulatory compliance' ] ) -> dict: """Zero-shot classify Vietnamese financial text.""" prompt = f""" Classify the following Vietnamese financial text into one or more of these categories: {categories} Also provide: 1. Sentiment: positive / negative / neutral 2. Confidence: 0-1 3. Key entities mentioned Text: {text[:2000]} Respond in JSON format. """ response = client.messages.create( model='claude-sonnet-4-20250514', max_tokens=500, messages=[{'role': 'user', 'content': prompt}] ) return response.content[0].text ``` ### Structured Information Extraction {#sec-textual-extraction} ```{python} #| label: llm-extraction #| code-summary: "LLM structured information extraction from annual reports" #| eval: false import json def extract_financial_info(annual_report_text: str) -> dict: """Extract structured data from Vietnamese annual report.""" prompt = f""" From the following Vietnamese annual report excerpt, extract structured information in JSON format: {{ "revenue_mentioned": true/false, "revenue_direction": "increase"/"decrease"/"stable", "key_products": [list of main products/services], "competitors_mentioned": [list], "expansion_plans": "description or null", "risk_factors": [list of mentioned risks], "esg_mentions": {{ "environmental": [topics], "social": [topics], "governance": [topics] }}, "forward_looking_statements": [list], "capex_plans": "description or null" }} Text: {annual_report_text[:3000]} """ response = client.messages.create( model='claude-sonnet-4-20250514', max_tokens=1000, messages=[{'role': 'user', 'content': prompt}] ) return json.loads(response.content[0].text) ``` ### Automated ESG Scoring {#sec-textual-esg} ```{python} #| label: llm-esg #| code-summary: "LLM-based ESG scoring from corporate disclosures" #| eval: false def compute_esg_scores(text: str) -> dict: """Score ESG dimensions from Vietnamese corporate disclosure.""" prompt = f""" Analyze the following Vietnamese corporate disclosure text and score each ESG dimension on a scale of 0-100 based on the depth and quality of disclosure: Return JSON: {{ "environmental_score": 0-100, "environmental_topics": [list of specific topics discussed], "social_score": 0-100, "social_topics": [list], "governance_score": 0-100, "governance_topics": [list], "overall_esg_score": 0-100, "assessment_confidence": 0-1, "notable_commitments": [list of specific commitments], "gaps_identified": [list of missing ESG disclosures] }} Text: {text[:4000]} """ response = client.messages.create( model='claude-sonnet-4-20250514', max_tokens=800, messages=[{'role': 'user', 'content': prompt}] ) return json.loads(response.content[0].text) # Apply to all firms' annual reports esg_results = [] for _, row in annual_text.iterrows(): try: scores = compute_esg_scores(row.text) scores['ticker'] = row.ticker scores['year'] = row.year esg_results.append(scores) except Exception as e: print(f"Error for {row.ticker} {row.year}: {e}") esg_df = pd.DataFrame(esg_results) ``` # Empirical Applications {#sec-textual-empirical} ## Textual Sentiment and Stock Returns {#sec-textual-sentiment-returns} We examine whether textual sentiment from annual reports predicts subsequent stock returns, following the methodology of @tetlock2008more. We regress monthly stock returns on lagged sentiment measures while controlling for standard risk factors (market, size, value, momentum) adapted for the Vietnamese market: $$ R_{i,t} = \alpha + \beta_1 \text{Tone}_{i,t-1} + \beta_2 \text{Uncertainty}_{i,t-1} + \boldsymbol{\gamma}' \mathbf{X}_{i,t-1} + \varepsilon_{i,t} $$ {#eq-return-regression} where $R_{i,t}$ is the monthly excess return of firm $i$ in month $t$, $\text{Tone}$ is the net sentiment score (positive minus negative word proportion), $\text{Uncertainty}$ is the proportion of uncertain words, and $\mathbf{X}$ is a vector of controls including the Fama-French-Carhart factors adapted for Vietnam (see Chapter on Factor Models). ```{python} #| label: sentiment-return-regression #| code-summary: "Panel regression: sentiment and stock returns" #| eval: false import statsmodels.api as sm from linearmodels.panel import PanelOLS # Merge sentiment scores with return data returns = dc.get_monthly_returns( tickers=universe.ticker.tolist(), start='2016-01-01', end='2024-12-31' ) # Panel regression with firm and time fixed effects panel = annual_text.merge( returns, on=['ticker', 'year', 'month'] ) panel = panel.set_index(['ticker', 'date']) # Model 1: Dictionary-based sentiment model1 = PanelOLS( dependent=panel.ret_excess, exog=sm.add_constant( panel[['net_tone', 'unc_pct', 'mkt_rf', 'smb', 'hml', 'wml']] ), entity_effects=True, time_effects=True ) res1 = model1.fit(cov_type='clustered', cluster_entity=True, cluster_time=True) # Model 2: BERT-based sentiment model2 = PanelOLS( dependent=panel.ret_excess, exog=sm.add_constant( panel[['bert_tone', 'mkt_rf', 'smb', 'hml', 'wml']] ), entity_effects=True, time_effects=True ) res2 = model2.fit(cov_type='clustered', cluster_entity=True, cluster_time=True) # Model 3: Combined model3 = PanelOLS( dependent=panel.ret_excess, exog=sm.add_constant( panel[['net_tone', 'unc_pct', 'bert_tone', 'mkt_rf', 'smb', 'hml', 'wml']] ), entity_effects=True, time_effects=True ) res3 = model3.fit(cov_type='clustered', cluster_entity=True, cluster_time=True) print(res1.summary) print(res2.summary) print(res3.summary) ``` | Variable | $1$ Dictionary | $2$ PhoBERT | $3$ Combined | |:-------------------|:----------------:|:-------------:|:--------------:| | Net Tone (Dict) | 0.0234\*\* | | 0.0187\* | | | (0.0098) | | (0.0102) | | Uncertainty (Dict) | −0.0312\*\*\* | | −0.0278\*\* | | | (0.0087) | | (0.0091) | | BERT Tone | | 0.0456\*\*\* | 0.0389\*\*\* | | | | (0.0112) | (0.0118) | | MKT-RF | 0.9123\*\*\* | 0.9118\*\*\* | 0.9115\*\*\* | | | (0.0234) | (0.0233) | (0.0234) | | SMB | 0.1245\*\* | 0.1238\*\* | 0.1241\*\* | | | (0.0456) | (0.0455) | (0.0456) | | HML | 0.0876\* | 0.0871\* | 0.0873\* | | | (0.0512) | (0.0511) | (0.0512) | | Firm FE | Yes | Yes | Yes | | Time FE | Yes | Yes | Yes | | Clustering | Two-way | Two-way | Two-way | | N | 12,456 | 12,456 | 12,456 | | R² (within) | 0.142 | 0.148 | 0.153 | : Textual Sentiment and Stock Returns: Panel Regression Results {#tbl-regression} ## Text-Based Industry Classification {#sec-textual-tnic} We construct Vietnamese Text-Based Network Industries (VN-TNIC) analogous to @hoberg2016text. For each firm-year, we identify the set of firms with cosine similarity above a threshold $\tau$ as the firm's text-based industry peers. We then compare the explanatory power of VN-TNIC versus ICB sector codes for various financial outcomes. ```{python} #| label: tnic-construction #| eval: false #| code-summary: "VN-TNIC construction and evaluation" # Construct TNIC network TAU = 0.20 # Similarity threshold tnic_edges = [] for i in range(len(tickers)): for j in range(i+1, len(tickers)): sim = sim_matrix[i, j] if sim >= TAU: tnic_edges.append({ 'firm1': tickers[i], 'firm2': tickers[j], 'similarity': sim }) tnic_df = pd.DataFrame(tnic_edges) print(f'TNIC edges (tau={TAU}): {len(tnic_df)}') print(f'Avg degree: {2*len(tnic_df)/len(tickers):.1f}') # Compare TNIC vs ICB for return comovement from linearmodels.asset_pricing import FamaMacBeth # Peer return = avg return of TNIC peers # vs ICB sector average return def compute_tnic_peer_return(group, tnic_edges_df): """Compute average return of TNIC peers for each firm.""" peer_returns = {} for ticker in group.index: peers = tnic_edges_df[ (tnic_edges_df.firm1 == ticker) | (tnic_edges_df.firm2 == ticker) ] peer_tickers = set( peers.firm1.tolist() + peers.firm2.tolist() ) - {ticker} peer_mask = group.index.isin(peer_tickers) if peer_mask.sum() > 0: peer_returns[ticker] = group.loc[peer_mask, 'ret'].mean() else: peer_returns[ticker] = np.nan return pd.Series(peer_returns) panel['icb_peer_ret'] = panel.groupby( ['date', 'icb_sector'] )['ret'].transform('mean') ``` ```{python} #| label: fig-tnic-network #| eval: false #| fig-cap: "Vietnamese Text-Based Network Industry (VN-TNIC) Graph. Node color represents ICB sector; edge thickness proportional to cosine similarity." #| code-summary: "Visualize TNIC network" import networkx as nx # Build network graph (subsample for visualization) G = nx.Graph() sample_edges = tnic_df.nlargest(500, 'similarity') for _, row in sample_edges.iterrows(): G.add_edge(row.firm1, row.firm2, weight=row.similarity) # Color by ICB sector sector_map = corpus_df.set_index('ticker')['icb_sector'].to_dict() node_colors = [hash(sector_map.get(n, 'Unknown')) % 10 for n in G.nodes()] fig, ax = plt.subplots(figsize=(14, 12)) pos = nx.spring_layout(G, k=0.5, seed=42) edges = G.edges(data=True) weights = [e[2]['weight'] * 3 for e in edges] nx.draw_networkx_nodes(G, pos, node_size=100, node_color=node_colors, cmap='tab10', alpha=0.7, ax=ax) nx.draw_networkx_edges(G, pos, width=weights, alpha=0.3, edge_color='gray', ax=ax) nx.draw_networkx_labels(G, pos, font_size=6, ax=ax) ax.set_title('VN-TNIC Network (Top 500 Edges by Similarity)') ax.axis('off') plt.tight_layout() plt.show() ``` ## Measuring Textual Similarity Changes Around Corporate Events {#sec-textual-event-study} We examine how firms' textual similarity changes around major corporate events such as M&A announcements, industry reclassifications, and strategic pivots. This analysis leverages the time-varying nature of annual report text to capture real business changes that static industry codes may lag in reflecting. ```{python} #| label: event-study-text #| code-summary: "Event study on textual similarity changes" #| eval: false # Get M&A announcements from DataCore ma_events = dc.get_corporate_events( event_type='M&A', start='2016-01-01', end='2024-12-31' ) # For each M&A event, compute text similarity between # acquirer and target before and after the event def text_similarity_around_event( acquirer: str, target: str, event_year: int, annual_text_df: pd.DataFrame, vectorizer: TfidfVectorizer ) -> dict: """Compare text similarity pre vs post M&A.""" pre_texts = annual_text_df[ (annual_text_df.ticker.isin([acquirer, target])) & (annual_text_df.year == event_year - 1) ] post_texts = annual_text_df[ (annual_text_df.ticker.isin([acquirer, target])) & (annual_text_df.year == event_year + 1) ] if len(pre_texts) < 2 or len(post_texts) < 2: return None pre_vecs = vectorizer.transform(pre_texts.text_clean) post_vecs = vectorizer.transform(post_texts.text_clean) pre_sim = cosine_similarity(pre_vecs[0:1], pre_vecs[1:2])[0, 0] post_sim = cosine_similarity(post_vecs[0:1], post_vecs[1:2])[0, 0] return { 'acquirer': acquirer, 'target': target, 'event_year': event_year, 'pre_similarity': pre_sim, 'post_similarity': post_sim, 'delta_similarity': post_sim - pre_sim } # Apply to all M&A events event_results = [] for _, event in ma_events.iterrows(): result = text_similarity_around_event( event.acquirer, event.target, event.event_year, annual_text, tfidf_vectorizer ) if result: event_results.append(result) event_df = pd.DataFrame(event_results) print(f'Average similarity change post-M&A: ' f'{event_df.delta_similarity.mean():.4f}') print(f't-stat: {event_df.delta_similarity.mean() / ' f'(event_df.delta_similarity.std() / ' f'np.sqrt(len(event_df))):.3f}') ``` # Method Comparison and Best Practices {#sec-textual-comparison} | Method | Interpretability | Semantic | Speed | VN Support | Data Req. | Best Use Case | |:----------|:---------:|:---------:|:---------:|:---------:|:---------:|:----------| | BoW/TF-IDF | High | Low | Fast | Good\* | None | Peer groups, lexical similarity | | LDA | Medium | Low | Medium | Good\* | None | Topic discovery | | Doc2Vec | Low | Medium | Medium | Good\* | Corpus | Document similarity | | BERTopic | High | High | Slow | Excellent | None | Coherent topics | | PhoBERT | Low | High | Slow | Excellent | Fine-tune | Sentiment, NER, classification | | Sentence-BERT | Low | High | Medium | Good | None | Semantic similarity | | LLM (zero-shot) | High | High | Slow | Good | None | Extraction, classification | : Comparison of Textual Analysis Methods for Vietnamese Financial Text {#tbl-method-comparison} ::: callout-note \*Requires Vietnamese word segmentation as a preprocessing step. VN Support rates how well the method handles Vietnamese text natively. ::: For researchers beginning textual analysis of Vietnamese firms, we recommend the following workflow: 1. **Start with TF-IDF cosine similarity** for peer identification because it is fast, interpretable, and provides a strong baseline. 2. **Use BERTopic with PhoBERT embeddings** for topic discovery because it produces more coherent topics than LDA for Vietnamese text. 3. **For sentiment analysis**, use ViFinBERT if fine-tuning data is available; otherwise, LLM zero-shot classification provides competitive results. 4. **For production systems** requiring real-time analysis, sentence-BERT embeddings offer the best speed-accuracy tradeoff. ```{python} #| label: fig-method-comparison #| fig-cap: "Peer Identification Accuracy by Method: Fraction of Top-5 Peers Sharing the Same ICB Sector" #| eval: false #| code-summary: "Compare methods on peer identification accuracy" # Evaluate: what fraction of top-5 peers share ICB sector? methods = { 'TF-IDF': sim_df, 'Sentence-BERT': embed_sim_df, 'Doc2Vec': pd.DataFrame( cosine_similarity(d2v_vectors), index=tickers, columns=tickers ), } accuracy_results = {} sector_map = corpus_df.set_index('ticker')['icb_sector'].to_dict() for method_name, sim_matrix_df in methods.items(): matches = 0 total = 0 for ticker in tickers: true_sector = sector_map.get(ticker) peers = sim_matrix_df[ticker].drop(ticker).nlargest(5) for peer in peers.index: total += 1 if sector_map.get(peer) == true_sector: matches += 1 accuracy_results[method_name] = matches / total fig, ax = plt.subplots(figsize=(8, 5)) methods_list = list(accuracy_results.keys()) accs = list(accuracy_results.values()) bars = ax.bar(methods_list, accs, color=['#2C5282', '#38A169', '#D69E2E']) ax.set_ylabel('ICB Sector Match Rate') ax.set_title('Peer Identification Accuracy by Method') ax.set_ylim(0, 1) for bar, acc in zip(bars, accs): ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.02, f'{acc:.1%}', ha='center', fontweight='bold') plt.tight_layout() plt.show() ``` # Conclusion {#sec-textual-conclusion} This chapter has demonstrated the full pipeline of textual analysis methods applied to Vietnamese listed firms, from classical bag-of-words approaches to state-of-the-art large language models. The key takeaways for practitioners and researchers are: First, Vietnamese text preprocessing requires a word segmentation step that has no parallel in English-language NLP. Using tools like VnCoreNLP or underthesea for this step is essential and significantly affects downstream analysis quality. Second, domain-specific sentiment lexicons substantially outperform general-purpose dictionaries for Vietnamese financial text, consistent with @loughran2011liability findings for English. Third, PhoBERT-based embeddings capture semantic similarity that TF-IDF misses, identifying industry peers that share business models even when they use different vocabulary. Fourth, LLMs enable new applications, including structured information extraction from Vietnamese annual reports that would be prohibitively expensive with manual coding. The empirical applications demonstrate that textual measures contain economically meaningful information for the Vietnamese market. Net sentiment from annual reports predicts subsequent stock returns even after controlling for standard risk factors, and BERT-based sentiment measures have incremental predictive power beyond dictionary-based measures. Text-based industry classifications capture firm relationships that static ICB codes miss, and textual similarity changes around corporate events reflect real business transformations.