Python Forum
Add NER output to pandas dataframe
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Add NER output to pandas dataframe
#1
I'm trying to write a script that will scan a document and add the extracted information to a dataframe, currently it just prints it and exits. Obviously, the goal is to be able to associate certain named entities with documents where they're mentioned. The place where I'm truly lost is how to process the NER output into separate rows, it should be easy because it's just a list with fairly clear separations, but I'm not sure if there's a particular library I should use, or exactly how to get these two aspects to "talk"

my working code is here, I also have a not working version where I've attempted to add the dataframe functionality, but I feel it's so far off from correct to be not worth including
#!/usr/bin/env python3

import os
from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

from transformers import pipeline
ner_cls = pipeline("ner", model=model, tokenizer=tokenizer)


x = input ('Enter document:')

document = open (x, "r").read() 
ner_results = ner_cls(document)

organized_results = {'LOC': [], 'PER': [], 'ORG': [], 'MISC': []}

current_entity = None
current_words = []

for result in ner_results:
    entity_type = result['entity'].split('-')[1]
    if result['entity'].startswith('B-'):
        if current_entity:
            organized_results[current_entity].append(' '.join(current_words))
        current_entity = entity_type
        current_words = [result['word']]
    elif result['entity'].startswith('I-') and current_entity == entity_type:
        current_words.append(result['word'])

# Handle the last entity
if current_entity:
    organized_results[current_entity].append(' '.join(current_words))

# Remove hash symbols from words
for key, value in organized_results.items():
    organized_results[key] = [' '.join(word.split('##')) for word in value]

print(organized_results)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML Decoder pandas dataframe column mbrown009 3 1,104 Sep-29-2023, 05:56 PM
Last Post: deanhystad
  Use pandas to obtain cartesian product between a dataframe of int and equations? haihal 0 1,143 Jan-06-2023, 10:53 PM
Last Post: haihal
  Pandas Dataframe Filtering based on rows mvdlm 0 1,471 Apr-02-2022, 06:39 PM
Last Post: mvdlm
  Pandas dataframe: calculate metrics by year mcva 1 2,359 Mar-02-2022, 08:22 AM
Last Post: mcva
  Pandas dataframe comparing anto5 0 1,290 Jan-30-2022, 10:21 AM
Last Post: anto5
  PANDAS: DataFrame | Replace and others questions moduki1 2 1,835 Jan-10-2022, 07:19 PM
Last Post: moduki1
  PANDAS: DataFrame | Saving the wrong value moduki1 0 1,581 Jan-10-2022, 04:42 PM
Last Post: moduki1
  update values in one dataframe based on another dataframe - Pandas iliasb 2 9,389 Aug-14-2021, 12:38 PM
Last Post: jefsummers
  empty row in pandas dataframe rwahdan 3 2,492 Jun-22-2021, 07:57 PM
Last Post: snippsat
Question Pandas - Creating additional column in dataframe from another column Azureaus 2 3,007 Jan-11-2021, 09:53 PM
Last Post: Azureaus

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020