How to properly format rows and columns in excel data from parsed .txt blocks

jh67 · Dec-09-2022, 08:00 PM

Hi. I'm trying to create an excel sheet using data made up from specific, recurring blocks of text in a .txt file and ignoring all other data in the file (table of contents, etc). The blocks of text are in the format shown here and I'm having two different issues with the output in the final excel sheet.

Quote:ItemName
Status: StatusItem
Category: CategoryItem
Sub-Category: SubcategoryItem
Date: MM/dd/yyyy
Frequency: FrequenceItem
Language: LanguageItem
Description: DescriptionText

Currently, my code is using a dictionary made by searching for a block pattern, inserting it into a data-frame (pandas) and exporting to excel. The issue I'm having, however, is that it creates a new row for each line in the text block rather than creating a new row for each block. Shown in the attached image.

How can I prevent this with my block pattern?

The second issue is that the top item ("ITEMNAME") is not in the same format as it doesn't have a "key" to use in the dictionary. How would I include it in the block pattern to add to the dictionary and export being in a different format?

Code:

import pandas as pd
import re

txtFilePath = r'/test.txt'

with open(txtFilePath, 'r') as f:
    text = f.read()

# define pattern for block search - first method
block_pattern = re.compile(r"^(?:Status|Event Category|Event Sub-Category|"
                           r"Added Date|Update Frequency|Language|Description):.+$",
                           re.MULTILINE)


# find all blocks that match the pattern
blocks = block_pattern.findall(text)

df = pd.DataFrame()
# columns=["Status", "Category", "Sub-Category", "Date", "Frequency", "Language", "Description"]

# iterate over blocks and extract patterns
for block in blocks:
    # split blocks on individual lines
    lines = block.strip().split("\n")
    # dict to store items for this block
    item_dict = {}

    # iterate to find type and values
    for line in lines:
        item_parts = line.strip().split(":", 1)
        item_type = item_parts[0].strip()
        item_value = item_parts[1].strip()

        # add item to dict
        item_dict[item_type] = item_value

    df = df.append(item_dict, ignore_index=True)

# print(df)
df.to_excel(r'/text.xlsx')

I've tried two changes to the append line to fix the extra rows.

First change:

df.loc[0] = df.append(item_dict, ignore_index=True)

but it gives this error:
>

Quote:ValueError: cannot set row with mismatched columns

Second Change:

df.loc[0] = item_dict

This change creates the columns but imports no data from the dictionary. The excel sheet has columns but nothing else.

I've been scouring the internet for solutions and have really had no luck. Thanks in advance for any assistance.

Pedroski55 · Dec-09-2022, 10:43 PM

Quote:How would I include it in the block pattern to add to the dictionary

Just add a entry to your dictionary: key = "ItemName", value = "Terminator"

Quote:and export being in a different format?

openpyxl has formatting options for text and cells just check that out.

I don't know much about pandas, but I don't think it does formatting!

Post a few lines of your text data and an example of the desired output, then things will be clearer.

jh67 · Dec-10-2022, 06:38 PM

One sample block. It's worth noting that there are over 400 pages of these blocks in the text file but after the table of contents and a brief appendix at the end, these blocks are the only data in there besides footers. some blocks have empty values in fields but they always have the key at least.

Institution 123
Status: Production
Event Category: News
Event Sub-Category: N/A
Added Date: 5/10/2017
Update Frequency: Weekly
Language: English
Description: Local financial institution

**deanhystad** · (This post was last modified: Dec-11-2022, 01:54 AM by deanhystad.)

I think your pattern is wrong for finding blocks. I think it should look like this. Note that I left out some of the fields to shorten this post.

import re
import pandas as pd

pattern = re.compile(
    r"^(.*)\n"
    r"^Status: (.*)\n"
    r"^Category: (.*)\n"
    r"^Description: (.*)\n",
    flags=re.MULTILINE)

with open("data.txt", "r") as file:
    text = "".join(file)

columns = ["Item", "Status", "Category", "Description"]
print(pd.DataFrame.from_records(re.findall(pattern, text), columns=columns))

I made a dummy file with some valid and invalid blocks and extra fluff to ignore.

Output:ItemName 1
Status: Status Item 1
Category: Category Item 1
Description: Description Text 1
extra
stuff
ItemName 2
Category: Order is wrong
Status: Status Item 2
Description: Description Text 2
extra
stuff
ItemName 3
Status: Status Item 3
Category: Category Item 3
Sub-Category: Extra field
Description: Description Text 3
extra
stuff
ItemName 4
Statis: Spelling error
Category: Category Item 4
Description: Description Text 4
extra
stuff
ItemName 5
Status: Status Item 5
Category: Category Item 5
Description: Description Text 5
extra
stuff

When I run the program it finds the two valid blocks.

Output:         Item         Status         Category         Description
0  ItemName 1  Status Item 1  Category Item 1  Description Text 1
1  ItemName 5  Status Item 5  Category Item 5  Description Text 5

jh67 · Dec-11-2022, 12:11 AM

Holy cow, this worked brilliantly. Thank you so much. It exported to excel flawlessly.

I do have one further question. I should have detailed it in my original post but the last field (Description) is longer than the rest of the fields in some cases and it looks like it's cutting it off at the line break when the field is too long. How can I cut the description field off at the end of the paragraph? Example:

Institution 123
Status: Production
Event Category: News
Event Sub-Category: N/A
Added Date: 5/10/2017
Update Frequency: Weekly
Language: English
Description: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco
laboris nisi ut aliquip ex ea commodo consequat.

Institution 123
Status: Production
Event Category: News
Event Sub-Category: N/A
Added Date: 5/10/2017
Update Frequency: Weekly
Language: English
Description: Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum.

**deanhystad** · (This post was last modified: Dec-11-2022, 04:21 AM by deanhystad.)

Is there always a blank line following the description? How can you have a blank line be part of the pattern? I don't know. This might have to be a multistep process.

jh67 · Dec-12-2022, 01:47 PM

Yes, there's always a blank line after the description block, regardless of if there is anything in the description field. Also, there is always a period at the end of the description field when there is actually text there. I thought about splitting on a period for that block but not sure how it would handle empty description blocks.

jh67 · (This post was last modified: Dec-12-2022, 08:23 PM by jh67.)

I've tried to add an additional blank line to the end of the block description field in the pattern like this but when I add it, the excel file ends up being empty so I assume its because this pattern isn't recognized. There's always a blank line so I'm likely just formatting this extra empty line wrong.

block_pattern = re.compile(
    r"^(.*)\n"
    r"^Status: (.*)\n"
    r"^Event Category: (.*)\n"
    r"^Event Sub-Category: (.*)\n"
    r"^Added Date: (.*)\n"
    r"^Update Frequency: (.*)\n"
    r"^Language: (.*)\n"
    r"^Description: (.*)\n\*s\n",
    flags=re.MULTILINE)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Excel isnt working properly after python function is started	IchNar	2	360	May-01-2024, 06:43 PM Last Post: IchNar
	RSA Cipher with blocks	Paragoon2	0	547	Nov-26-2023, 04:35 PM Last Post: Paragoon2
	Export data from PDF as tabular format	zinho	5	795	Nov-11-2023, 08:23 AM Last Post: Pedroski55
	how do you style data frame that has empty rows.	gsaray101	0	555	Sep-08-2023, 05:20 PM Last Post: gsaray101
	Copy data from Excel and paste into Discord (Midjourney)	Joe_Wright	4	2,194	Jun-06-2023, 05:49 PM Last Post: rajeshgk
	Reading data from excel file –> process it >>then write to another excel output file	Jennifer_Jone	0	1,167	Mar-14-2023, 07:59 PM Last Post: Jennifer_Jone
	Converting a json file to a dataframe with rows and columns	eyavuz21	13	4,852	Jan-29-2023, 03:59 PM Last Post: eyavuz21
	Comparing two columns with same value but different font format	doug2019	1	764	Jan-08-2023, 02:58 PM Last Post: Larz60+
	(Python) Pulling data from UA Google Analytics with more than 100k rows into csv.	Stockers	0	1,285	Dec-19-2022, 11:11 PM Last Post: Stockers
	Extracting Data into Columns using pdfplumber	arvin	17	5,875	Dec-17-2022, 11:59 AM Last Post: arvin

How to properly format rows and columns in excel data from parsed .txt blocks

User Panel Messages

Announcements