Tutorial 009: How Extract & Summarize Text from PDFs & Scans

Automate PDF and image text extraction with 0CodeKit. Learn how to create concise summaries from the extracted for improved productivity and data management.

Published
December 18, 2024

Before, students would have to go to a library, read thousands of pages about a topic they were doing research on, and try to summarize all of their findings into a few paragraphs. Similarly with professionals, to be able to prepare themselves to present something to their colleagues or a customer, they would have to get access to several reports and portfolios, read through all of them, and gather the most important information. Therefore, in this blog, we would like to tackle this time-consuming activity by sharing an automation that can summarize the content.

More importantly, we've also adapted it for those who still write everything down but would like to have them digitally or those that work with PDFs but cannot copy and paste the text because PDF documents don't allow it. For these type of situations, we've created an automation that extracts the text found from PDFs, scans, and even images, summarizes it, and stores the summary of these documents in DropBox or Google Drive. Here's a step-by-step guide on how to build this automation.

Not a fan of reading? No problem! Check out our quick, easy-to-follow video tutorial to learn everything you need!

Setting Up the Automation

First, we sign up or log into one automation platform where all 0CodeKit features are available (Make, Zapier, and n8n).

After that, we can set up the first Dropbox module and choose the feature called "Watch Files", which will look at a specified folder and it'll trigger whenever a file is uploaded. Later, we need to add a second Dropbox/Google Drive module with the feature "Download File" for the 0CodeKit to access this document. To set it up, we only need to fill in which file we would like to download.

Once the Dropbox/Google Drive module has been set up, we must integrate the 0CodeKit app, and find the feature "Create temporary URL to file" for 0CodeKit to be able to access the document via the URL. Here, we only have to click on the option "Dropbox/Google Drive - Download a File".

For the second part of the automation, we now need to integrate another 0CodeKit module with the feature "PDF OCR"for the module to extract the text if the document uploaded to Dropbox is a PDF. To set it up, we only need to enter the "Temporary File URL" icon into the "PDF URL" field.

Then, we need to add another 0CodeKit module with the feature "Rephrase Too Long to Read Text", which will take the recognized text from the PDF and summarize it. Here, we only need to enter the "Recognized Text" icon into the "Input text" field.

Afterwards, we need to add one last 0CodeKit module with the feature "Markdown String to PDF" that will convert the recognized text into a PDF since the output of the Too Long to Read feature will be just a bunch of text. We can see in the "Markdown String" field that we can give the output some order by adding a heading and the summarized text below. Note: The "CSS" field won't need to be filled.

Finally, we need to add one Dropbox module with the feature "Upload a File" which will upload the summarized data back into Dropbox. To set it up, we need to tell the module in which folder we would like to store the data, give the document a name, and which information will this document contain.

But what happens if the document we have, is not a PDF but an image or a scan instead? For this reason, we will need to add a router for us to be able to add a module that can extract and summarize the text from images or scans.

For the third part of the automation, we need to add the 0CodeKit module with the feature "Detect Text in a Picture with OCR" in order for the automation to extract text from images and scans as well. Here, we only need to enter the "Temporary File URL" icon into the "Image URL" field.

Finally, we need to set the other 0CodeKit modules that we did in the second part of the automation all over again. We will need the "Rephrase Too Long to Read Text" module with "Texts in Picture" icon, the "Markdown String to PDF"module and define the heading and data that we want to turn into PDF, and the "Upload a File" module from Dropbox to upload the summarized data back into Dropbox.

In Closing

If you would like to know how to use other 0CodeKit features, head to our YouTube channel for more tutorials.

Tutorial 009: How Extract & Summarize Text from PDFs & Scans

Setting Up the Automation

In Closing

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

Related Endpoints

Detect Entity

Detect Language

Mood Detector

Picture Object Recognition

About

Resources

Legals

Tutorial 009: How Extract & Summarize Text from PDFs & Scans

Setting Up the Automation

In Closing

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text

Related Endpoints

Detect Entity

Detect Language

Mood Detector

Picture Object Recognition

About

Resources

Legals

Trust & Compliant