Python Forum
Reading All The RAW Data Inside a PDF
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Reading All The RAW Data Inside a PDF
#1
Hi, can anyone suggest code that I can use that will return all the raw data in a PDF (including any special tags/mark up applied to text).

Appreciate you all.

-Jim
Reply
#2
I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#3
(Nov-30-2022, 06:58 PM)rob101 Wrote: I've been looking at some of the PDF libraries myself, and from what I know so far, I'd suggest you take a look at PyPDF2

Yes, I tried this one already, and when I used:

import PyPDF2
import fitz 
import re


#Assign File
file_name = "STRIVE December Schedule -A.pdf"

doc = PyPDF2.PdfFileReader(file_name)

#Number of pages
pages = doc.getNumPages()

for page in doc:
    current_page = doc.getPage(i)
    text = current_page.extractText()

    print(text)
The text returned was the "readable" text from the PDF. What I want is a level BELOW that, where I can see the raw markup/tags applied to all the text.
Larz60+ write Nov-30-2022, 10:55 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Fixed for you this time. Please use BBCode tags on future posts.
Reply
#4
Ah, okay. Well the only other one I've used is pdfrw 0.4

I've not used it for what you're tying to do, but you may find something there that will work for you.
NBAComputerMan likes this post
Sig:
>>> import this

The UNIX philosophy: "Do one thing, and do it well."

"The danger of computers becoming like humans is not as great as the danger of humans becoming like computers." :~ Konrad Zuse

"Everything should be made as simple as possible, but not simpler." :~ Albert Einstein
Reply
#5
if you really want to get down to the nitty-gritty, see: https://opensource.adobe.com/dc-acrobat-...arted.html
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Reading an ASCII text file and parsing data... oradba4u 2 146 Yesterday, 12:41 AM
Last Post: oradba4u
  Reading Data from JSON tpolim008 2 1,197 Sep-27-2022, 06:34 PM
Last Post: Larz60+
  Help reading data from serial RS485 korenron 8 14,582 Nov-14-2021, 06:49 AM
Last Post: korenron
  Help with WebSocket reading data from anoter function korenron 0 1,386 Sep-19-2021, 11:08 AM
Last Post: korenron
  Fastest Way of Writing/Reading Data JamesA 1 2,279 Jul-27-2021, 03:52 PM
Last Post: Larz60+
  Reading data to python: turn into list or dataframe hhchenfx 2 5,547 Jun-01-2021, 10:28 AM
Last Post: Larz60+
  Reading data from mysql. stsxbel 2 2,286 May-23-2021, 06:56 PM
Last Post: stsxbel
  reading canbus data as hex korenron 9 6,511 Dec-30-2020, 01:52 PM
Last Post: korenron
  Reading Serial data Moris526 6 5,590 Dec-26-2020, 04:04 PM
Last Post: Moris526
  wrong data reading on uart fahri 6 3,512 Sep-29-2020, 03:07 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020