Using Llama2 & Spark

Overview

This exercise is about using Llama 2, an LLM (Large Language Model) from Meta AI, to summarise many documents at once. Spark is used to take advantage of parallel processing.

Note: This is not my work. I followed this page - https://towardsdatascience.com/distributed-llama-2-on-cpus-via-llama-cpp-pyspark-65736e9f466d

1. Create a virtual env

python -m venv llama_spark_venv
source llama_spark_venv/bin/activate

2. Download the model

This step downloads Llama 2 7B Chat model that has been converted to ggml format. Apart from this, there're other models available on this page https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML.

mkdir llama_spark
cd llama_spark/

mkdir models
cd models/

wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q8_0.bin
ls -lrt
-rw-r--r--  1 rk  staff  7160799872 19 Jul 03:50 llama-2-7b-chat.ggmlv3.q8_0.bin

3. Install llama-cpp Python bindings

pip install llama-cpp-python
Collecting llama-cpp-python
  Using cached llama_cpp_python-0.1.77.tar.gz (1.6 MB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting typing-extensions>=4.5.0
  Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting numpy>=1.20.0
  Using cached numpy-1.25.2-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
Collecting diskcache>=5.6.1
  Using cached diskcache-5.6.1-py3-none-any.whl (45 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... done
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.77-cp311-cp311-macosx_13_0_arm64.whl size=236114 sha256=3054fe6a05eecdae80e979077f8e4578ec2bf6102d089cd1eb81f503f0239e33
  Stored in directory: /Users/rk/Library/Caches/pip/wheels/a2/ea/0a/19ffc6aaf5c35243864ffca3f6bb4c971bdaad17fb863f9b9a
Successfully built llama-cpp-python
Installing collected packages: typing-extensions, numpy, diskcache, llama-cpp-python
Successfully installed diskcache-5.6.1 llama-cpp-python-0.1.77 numpy-1.25.2 typing-extensions-4.7.1

Testing

from llama_cpp import Llama
llm = Llama(model_path="./llama-2-7b-chat.ggmlv3.q8_0.bin")

output = llm("Q: Name the planets in the solar system? A: ", max_tokens=400, stop=["Q:", "\n"], echo=True)
print(output)
{'id': 'cmpl-1fd69252-bc63-483e-b1ff-75897054d72d', 'object': 'text_completion', 'created': 1691060842, 'model': './llama-2-7b-chat.ggmlv3.q8_0.bin', 'choices': [{'text': 'Q: Name the planets in the solar system? A: 1. Pluto is no longer considered a planet, but it is still listed as a dwarf planet. 2. Mercury - closest planet to', 'index': 0, 'logprobs': None, 'finish_reason': 'length'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 32, 'total_tokens': 47}}

4. Download the text that will be summarised

▓▒░ mkdir data
▓▒░ cd data/
▓▒░ curl "https://gutenberg.org/cache/epub/2600/pg2600.txt" -o war_and_peace.txt
▓▒░ ls -l war_and_peace.txt
-rw-r--r--  1 rk  staff  3359834  3 Aug 21:12 war_and_peace.txt
# print lines, words, characters
echo "$(cat ./war_and_peace.txt | wc -l) lines"
echo "$(cat ./war_and_peace.txt | wc -w) words"
echo "$(cat ./war_and_peace.txt | wc -c) characters"
66081 lines
566325 words
3359834 characters

5. Install Pyspark

pip install pyspark
pip install pandas
░▒▓    /Volumes/samsung-2tb/rk/llama_spark ▓▒░ tree
.
├── data
│   └── war_and_peace.txt
├── models
│   └── llama-2-7b-chat.ggmlv3.q8_0.bin
└── process.py

2 directories, 4 files

process.py

import re  
  
import pandas as pd  
from pyspark.sql import SparkSession  
  
  
# this is the function applied per-group by Spark  
# the df passed is a *Pandas* dataframe!  
def llama2_summarize(df):  
    # read model  
    from llama_cpp import Llama  
  
    # template for this model version, see:  
    # https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML#prompt-template-llama-2-chat    template = """  
    [INST] <<SYS>>    You are a helpful, respectful and honest assistant.    Always answer as helpfully as possible, while being safe.    
    Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.   
    Please ensure that your responses are socially unbiased and positive in nature.  
    If a question does not make any sense, or is not factually coherent, explain why instead of answering    something not correct.   
    If you don't know the answer to a question, please don't share false information.  
    <</SYS>>    {INSERT_PROMPT_HERE} [/INST]    """  
    # create prompt  
    chapter_text = df.iloc[0]["text"]  
    chapter_num = df.iloc[0]["chapter"]  
    prompt = (  
        "Summarize the following novel chapter in a single sentence (less than 100 words):"  
        + chapter_text  
    )  
    prompt = template.replace("INSERT_PROMPT_HERE", prompt)  
  
    print("Going to invoke llm()")  
    llm = Llama(  
        model_path="./models/llama-2-7b-chat.ggmlv3.q8_0.bin",  
        n_ctx=4096,  
        n_batch=512,  
        n_threads=8,  
        verbose=True,  
    )  
  
    output = llm(prompt, max_tokens=-1, echo=True, temperature=0.2, top_p=0.1)  
    print(output)  
  
    return pd.DataFrame(  
        {"summary": [output["choices"][0]["text"]], "chapter": [int(chapter_num)]}  
    )  
  
  
spark = SparkSession.builder.appName("my-spark-app").getOrCreate()  
  
# read book, remove header/footer  
text = open("./data/war_and_peace.txt", "r").read()  
text = text.split("PROJECT GUTENBERG EBOOK WAR AND PEACE")[1]  
  
# get list of chapter strings  
chapter_list = [x for x in re.split("CHAPTER .+", text) if len(x) > 100]  
  
# print stats  
print("number of chapters = " + str(len(chapter_list)))  
print("max words per chapter = " + str(max([len(c.split(" ")) for c in chapter_list])))  
  
# create Spark dataframe, show it  
df = spark.createDataFrame(  
    pd.DataFrame({"text": chapter_list, "chapter": range(1, len(chapter_list) + 1)})  
)  
  
df.show(10, 60)  
  
# Test with 1 row  
pandas_df = df.limit(1).toPandas()  
resp = llama2_summarize(pandas_df)  
print(resp)  
  
# create summaries via Spark  
summaries = (  
    df  
    .groupby("chapter")  
    .applyInPandas(llama2_summarize, schema="summary string, chapter int")  
    .show(vertical=True, truncate=False)  
)
python process.py
23/08/04 19:43:26 WARN Utils: Your hostname, RKs-Mac-mini.local resolves to a loopback address: 127.0.0.1; using 192.168.0.20 instead (on interface en1)
23/08/04 19:43:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/08/04 19:43:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
number of chapters = 365
max words per chapter = 3636
+------------------------------------------------------------+-------+
|                                                        text|chapter|
+------------------------------------------------------------+-------+
|\n\n“Well, Prince, so Genoa and Lucca are now just family...|      1|
|\n\nAnna Pávlovna’s drawing room was gradually filling. T...|      2|
|\n\nAnna Pávlovna’s reception was in full swing. The spin...|      3|
|\n\nJust then another visitor entered the drawing room: P...|      4|
|\n\n“And what do you think of this latest comedy, the cor...|      5|
|\n\nHaving thanked Anna Pávlovna for her charming soiree,...|      6|
|\n\nThe rustle of a woman’s dress was heard in the next r...|      7|
|\n\nThe friends were silent. Neither cared to begin talki...|      8|
|\n\nIt was past one o’clock when Pierre left his friend. ...|      9|
|\n\nPrince Vasíli kept the promise he had given to Prince...|     10|
+------------------------------------------------------------+-------+
only showing top 10 rows

Going to invoke llm()
llama.cpp: loading model from /Volumes/samsung-2tb/rk/llama.cpp/models/llama-2-7b/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 4096
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 1.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4173.96 MB (+ 2048.00 MB per state)
llama_new_context_with_model: kv self size  = 2048.00 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

Links

Model

GGML

Python bindings for Llama2 CPP