Saturday, October 29, 2022

[SOLVED] Extracting hashtags and sections in document in ruby

Issue

I have a markdown text document with several sections and just below hashtags of the section. The hashtags are in the form #oneword# or #multiple words hashtag#.

I need to extract sections and their hashtags in ruby.

Example

# Section 1

#hash1# #hash tag 2# #hashtag3#

Some text

# Section 2

#hash1# #hash tag 4# #hash tag2#


Some text too

I want to get

{"Section 1"=>["hash1", "hash tag 2", "hashtag3"],
 "Section 2"=>["hash1", "hash tag 4", "hash tag2"]}

Can we get in from grep?


Solution

When faced with a problem such as this I tend to prefer the to use the builder pattern. It is a little verbose, but is normally very readable and very flexible.

The main idea is you have a "reader" that simply looks at your input and looks for "tokens', in this case lines, and when it finds a token that it recognizes it informs the builder that it found a token of interest. The builder builds another object based on input from the "reader". Here is an example of a "DocumentBuilder" that takes input from a "MarkdownReader" that builds the Hash that you are looking for.

class MarkdownReader
    attr_reader :builder

    def initialize(builder)
        @builder = builder
    end

    def parse(lines)
        lines.each do |line|
            case line
            when /^#[^#]+$/
                builder.convert_section(line)
            when /^#.+\#$/
                builder.convert_hashtag(line)
            end
        end
    end
end

class DocumentBuilder
    attr_reader :document

    def initialize()
        @document = {}
    end

    def convert_section(line)
        line =~ /^#\s*(.+)$/
        @section_name = $1
        document[@section_name] = []
    end
    
    def convert_hashtag(line)
        hashtags = line.split("#").reject {_1.strip.empty?}
        document[@section_name] += hashtags
    end
end

lines = File.readlines("markdown.md")
builder = DocumentBuilder.new 
reader = MarkdownReader.new(builder)
reader.parse(lines)
p builder.document

    => {"Section 1"=>["hash1", "hash tag 2", "hashtag3"], "Section 2"=>["hash1", "hash tag 4", "hash tag2"]}


Answered By - nPn
Answer Checked By - Timothy Miller (WPSolving Admin)