System2024

TextSumm

Large-Scale NLP Summarization Pipeline

“Can hybrid extractive–abstractive pipelines outperform pure transformer baselines at scale?”

3×

ROUGE-2 Improvement

61.32

ROUGE-1 F1

564K docs

Corpus Size

99.9%

Uptime

01. The Problem

Abstractive summarization models (BART, T5) produce fluent summaries but often hallucinate — they introduce facts not present in the source document. On a 564K-document corpus of professional documents (reports, articles, legal text), hallucination rates with pure abstractive models were unacceptably high for business use.

02.Why It's Hard

The core tension is fluency vs. faithfulness. Extractive models (TF-IDF, KMeans) are faithful but produce stilted, disconnected output. Abstractive transformers are fluent but unfaithful at scale. Combining them requires a principled pipeline that uses the extractive stage to constrain the abstractive model's input — without destroying the fluency benefit.

03.Our Approach: Hybrid Extractive–Abstractive Pipeline

A two-stage pipeline: (1) TF-IDF + KMeans clustering selects the most semantically representative sentences from the source document, reducing input length by ~70% while preserving key content; (2) a fine-tuned BART model generates a fluent abstractive summary from the extracted sentences. This constrains the abstractive model to source-grounded content, eliminating the primary hallucination source while maintaining ROUGE improvements over extractive-only baselines.

Architecture — Two-stage pipeline: TF-IDF/KMeans extraction → fine-tuned BART abstraction → FastAPI multi-format API.

1.Document ingested (PDF, DOCX, URL, raw text) via FastAPI multi-format parser
2.TF-IDF vectorization + KMeans clustering identifies representative sentences (~70% length reduction)
3.Fine-tuned BART generates abstractive summary from extracted sentences only
4.Summary returned via API with confidence score and extractive highlights
5.Docker container deployed on Azure ACI with GitHub Actions CI/CD pipeline

04. Key Results

▹Achieved 3× ROUGE-2 improvement over pure abstractive baselines (ROUGE-1: 61.32) on 564K+ document corpus
▹Novel hybrid architecture: fine-tuned transformer (abstractive) + TF-IDF/KMeans (extractive)
▹Multi-format FastAPI: PDF, DOCX, URL, raw text — reducing user processing time by 40%
▹Deployed on Azure ACI via Docker with automated CI/CD (GitHub Actions), 99.9% uptime

Method	ROUGE-2
Pure abstractive (BART baseline)	0.11
Pure extractive (TF-IDF/KMeans)	0.18
Hybrid pipeline (ours)← ours	0.33

05.What I Learned & Open Questions

The extractive stage's cluster count k is the most sensitive hyperparameter — too few clusters and key content is dropped; too many and the abstractive model receives near-full-length input, losing the compression benefit.
ROUGE scores are necessary but not sufficient — human evaluation showed that extractive-constrained summaries were significantly preferred for faithfulness despite similar ROUGE scores to pure abstractive.
Open question: Can the extractive stage be replaced by a lightweight cross-encoder ranker trained on faithfulness signals rather than TF-IDF similarity?

06. Tech Stack

NLPTransformersFastAPIDockerAzureCI/CDGitHub Actions

07. Artifacts

GitHub