How to compute conditional unigram probabilities?

jbond · Jan-25-2020, 11:34 AM

Hello,

i have difficulties with my homework (Task 4).
I don't know how to do this.
I have already an attempt but I think it is wrong and I don't know how to go on.
The task gives me pseudocode as a hint but I can't make code from it.
I know it's not that hard and ist only a few lines, but I have no Idea what to do.

my code for task 1, 2, 3 and my attempt for task 4:

 
import re

class Ngram:
    filename = ""
    n = 0
    raw_counts = {}
    prob = {}
    cond_prob = {}
    
    # Task 1
    def __init__(self, filename="", n=0):
        self.filename = filename
        self.n = n

    # Task 2
    def extract_raw_counts(self):
        fp = open(self.filename, 'r') 
        allLines = fp.readlines()
        for line in allLines: 
            tokenLst = tokenize_smart(line.rstrip("\r\n"))
            for i in range(0,self.n-1):
                tokenLst.insert(0,"BOS")
                tokenLst.append("EOS")
            for i in range(len(tokenLst)-self.n):
                newTuple = tuple(tokenLst[i:i+self.n])
                if newTuple in self.raw_counts:
                    self.raw_counts[newTuple] += 1
                else:
                    self.raw_counts[newTuple] = 1
        
    # Task 3
    def extract_probabilities(self):
        sumRawCounts = sum(self.raw_counts.values()) + len(self.raw_counts)
        for key in self.raw_counts:
            self.prob[key] = self.raw_counts[key] / sumRawCounts

    # Task 4
    def extract_conditional_probabilities(self):
        #my attemt for task 4
        for key in self.prob:
            mgram = key[0:self.n-1]
            unigram = key[self.n]
            if not mgram in self.prob:
                self.prob[mgram] = {}
            else:
                self.cond_prob[mgram] = unigram
            
        
        pass

    # Task 5
    def generate_random_token(self, mgram):
        """
        Generate a random next token based on an n-1 gram,
        taking into account the probability distribution over the possible next tokens for that n-1-gram.

        :param mgram: the n-1 gram to generate the next token for.
        :type mgram: a tuple (of length n-1) of strings.
        :return a random next token for the n-1-gram.
        :rtype str
        """
        pass

    # Task 6
    def generate_random_sentence(self):
        """
        Generate a random sentence.

        :return a random sentence
        :rtype list[str]
        """
        pass


def tokenize_smart(sentence):
    """
    Tokenize the sentence into tokens (words, punctuation).

    :param sentence: the sentence to be tokenized
    :type sentence: str
    :return: list of tokens in the sentence
    :rtype: list[str]
    """
    tokens = []
    for word in re.sub(r" +", " ", sentence).split():
        word = re.sub(r"[\"„”“»«`\(\)]", "", word)
        if word != "":
            if word[-1] in ".,!?;:":
                if len(word) == 1:
                    tokens += [word]
                else:
                    tokens += [word[:-1], word[-1]]
            else:
                tokens.append(word)

    return tokens


def list2str(sentence):
    """
    Convert a sentence given as a list of strings to the sentence as a string separated by whitespace.
    
    :param sentence: the string list to be joined
    :type sentence: list[str]
    :return: sentence as a string, separated by whitespace
    :rtype: str
    """
    sentence = " ".join(sentence)
    sentence = re.sub(r" ([\.,!\?;:])", r"\1", sentence)
    return sentence


if __name__ == '__main__':
    
    # Task 1
    print("Task 1:")
    ngram_model = Ngram("de-sentences-tatoeba.txt", 2)
    print(ngram_model.n, ngram_model.filename)
    print(ngram_model.raw_counts, ngram_model.prob, ngram_model.cond_prob)
    
    # Task 2
    print("\nTask 2:")
    ngram_model.extract_raw_counts()
    print(ngram_model.raw_counts[("kaltes", "Land")])
    print(ngram_model.raw_counts[("schönes", "Land")])
    
    # Task 3
    print("\nTask 3:")
    ngram_model.extract_probabilities()
    print(ngram_model.prob[("kaltes", "Land")])
    print(ngram_model.prob[("schönes", "Land")])
    
    '''
    # Task 4
    ngram_model.extract_conditional_probabilities()
    print(ngram_model.cond_prob[(" beobachteten ",)])
    print(ngram_model.cond_prob[("schönes",)][("Land")])
    # Task 5
    print(ngram_model.generate_random_token(("den",)))
    print(ngram_model.generate_random_token(("den",)))
    print(ngram_model.generate_random_token(("den",)))
    # Task 6
    print(list2str(ngram_model.generate_random_sentence()))
    print(list2str(ngram_model.generate_random_sentence()))
    '''

Task 1 and 2:
[Image: c8kWdT3]

Task 3 and 4: (Here I get stucked)
[Image: nnd3pD4]

Can someone help me or give me a hint?

Thank you in advance

**buran** · Jan-25-2020, 02:22 PM

cross-post on SO

jbond · Jan-25-2020, 02:58 PM

(Jan-25-2020, 02:22 PM)buran Wrote: cross-post on SO

Hi buran,

can you delete this thread?
Im not allowed to post my code public in the internet because other students of my class could copy this.
Then I would get 0 points for this homework.

Thank you.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Maths and python: Compute stress level	cheerful	1	2,762	Oct-20-2021, 10:05 AM Last Post: Larz60+
	Compute complex solutions in quadratic equations	liam	1	1,921	Feb-09-2020, 04:18 PM Last Post: Gribouillis
	To extract a specific column from csv file and compute the average	vicson	2	8,158	Oct-20-2018, 03:18 AM Last Post: vicson
	Write a program to compute the sum of the terms of the series: 4 - 8 + 12 - 16 + 20 -	chewey777	0	2,853	Mar-24-2018, 12:39 AM Last Post: chewey777
	How do you compute tf-idf from a list without using the counter class	syntaxkiller	8	5,317	Dec-01-2017, 05:24 PM Last Post: nilamo
	compute gross pay	jamesuzo	1	10,369	Sep-07-2017, 01:47 PM Last Post: ichabod801

How to compute conditional unigram probabilities?

User Panel Messages

Announcements